Optimizing Computation-Communication Overlap in Asynchronous Task-Based Programs
Asynchronous task-based programming models are gaining popularity to address programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, inefficient interactions between such programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We propose to expose information about MPI internals to a task-based runtime system to make better scheduling decisions. In particular, we show how existing mechanisms used to profile MPI implementations can be used to share information between MPI and a task-based runtime. Further, an evaluation of the proposed method shows performance improvements of up to 30.7% for applications with collective communication.