Corrected Trees for Reliable Group Communication (PPoPP 2019 - Main Conference)

Who

Martin Küttler, Maksym Planeta, Jan Bierbaum, Carsten Weinhold, Hermann Härtig, Amnon Barak, Torsten Hoefler

Track

PPoPP 2019 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Cancun.

Use conference time zone: (GMT-05:00) CancunSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 19 Feb 2019 15:45 - 16:10 at Salon 12/13 - Session 8: HPC Chair(s): I-Ting Angelina Lee

Abstract

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both, simulations and measurements, show our solution to achieve a latency reduction of 50% with up to six times fewer messages sent in comparison to existing schemes.

DOI

https://doi.org/10.1145/3293883.3295721

Martin Küttler

TU Dresden

Maksym Planeta

TU Dresden, Germany

Germany

Jan Bierbaum

TU Dresden

Carsten Weinhold

TU Dresden

Hermann Härtig

TU Dresden

Amnon Barak

The Hebrew University of Jerusalem

Torsten Hoefler

ETH Zurich

Switzerland

Time Zone

The program is currently displayed in (GMT-05:00) Cancun.

Use conference time zone: (GMT-05:00) CancunSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 19 Feb
Displayed time zone: Cancun change

15:45 - 16:35	Session 8: HPCMain Conference at Salon 12/13 Chair(s): I-Ting Angelina Lee Washington University in St. Louis

15:45 25m Talk		Corrected Trees for Reliable Group Communication Main Conference Martin Küttler TU Dresden, Maksym Planeta TU Dresden, Germany, Jan Bierbaum TU Dresden, Carsten Weinhold TU Dresden, Hermann Härtig TU Dresden, Amnon Barak The Hebrew University of Jerusalem, Torsten Hoefler ETH Zurich DOI
16:10 25m Talk		Adaptive Sparse Tiling for Sparse Matrix Multiplication Main Conference Changwan Hong , Aravind Sukumaran-Rajam Ohio State University, USA, Israt Nisa , Kunal Singh The Ohio State University, P. Sadayappan Ohio State University DOI