PPoPP 2019
Sat 16 - Wed 20 February 2019 Washington, DC, United States
Tue 19 Feb 2019 15:45 - 16:10 at Salon 12/13 - Session 8: HPC Chair(s): I-Ting Angelina Lee

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.

Tue 19 Feb

PPoPP-2019-papers
15:45 - 16:35: Main Conference - Session 8: HPC at Salon 12/13
Chair(s): I-Ting Angelina LeeWashington University in St. Louis
PPoPP-2019-papers15:45 - 16:10
Talk
Martin KüttlerTU Dresden, Maksym PlanetaTU Dresden, Germany, Jan BierbaumTU Dresden, Carsten WeinholdTU Dresden, Hermann HärtigTU Dresden, Amnon BarakThe Hebrew University of Jerusalem, Torsten HoeflerETH Zurich
DOI
PPoPP-2019-papers16:10 - 16:35
Talk
Changwan Hong, Aravind Sukumaran-RajamOhio State University, USA, Israt Nisa, Kunal SinghThe Ohio State University, P. SadayappanOhio State University
DOI