PPoPP 2019
Sat 16 - Wed 20 February 2019 Washington, DC, United States
Tue 19 Feb 2019 15:45 - 16:10 at Salon 12/13 - Session 8: HPC Chair(s): I-Ting Angelina Lee

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.

Tue 19 Feb

Displayed time zone: Guadalajara, Mexico City, Monterrey change

15:45 - 16:35
Session 8: HPCMain Conference at Salon 12/13
Chair(s): I-Ting Angelina Lee Washington University in St. Louis
15:45
25m
Talk
Corrected Trees for Reliable Group Communication
Main Conference
Martin Küttler TU Dresden, Maksym Planeta TU Dresden, Germany, Jan Bierbaum TU Dresden, Carsten Weinhold TU Dresden, Hermann Härtig TU Dresden, Amnon Barak The Hebrew University of Jerusalem, Torsten Hoefler ETH Zurich
DOI
16:10
25m
Talk
Adaptive Sparse Tiling for Sparse Matrix Multiplication
Main Conference
Changwan Hong , Aravind Sukumaran-Rajam Ohio State University, USA, Israt Nisa , Kunal Singh The Ohio State University, P. Sadayappan Ohio State University
DOI