Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency.
We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency.
Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.
Mon 18 FebDisplayed time zone: Guadalajara, Mexico City, Monterrey change
10:55 - 12:35 | Session 2: Heterogeneous Platforms and GPUMain Conference at Salon 12/13 Chair(s): Xu Liu College of William and Mary | ||
10:55 25mTalk | Throughput-Oriented GPU Memory Allocation Main Conference DOI | ||
11:20 25mTalk | SEP-Graph: Finding Shortest Execution Paths for Graph Processing under a Hybrid Framework on GPU Main Conference Hao Wang The Ohio State University, USA, Liang Geng The Ohio State University, USA, Rubao Lee United Parallel Computing Corporation, USA, Kaixi Hou Virginia Tech, USA, Yanfeng Zhang , Xiaodong Zhang The Ohio State University, USA DOI | ||
11:45 25mTalk | Incremental Flattening for Nested Data Parallelism Main Conference Troels Henriksen University of Copenhagen, Denmark, Frederik Thorøe DIKU, University of Copenhagen, Martin Elsman University of Copenhagen, Denmark, Cosmin Oancea University of Copenhagen, Denmark DOI | ||
12:10 25mTalk | Adaptive Sparse Matrix-Matrix Multiplication on the GPU Main Conference Martin Winter Graz University of Technology, Austria, Daniel Mlakar Graz University of Technology, Austria, Rhaleb Zayer Max Planck Institute for Informatics, Hans-Peter Seidel Max Planck Institute for Informatics, Markus Steinberger Graz University of Technology, Austria DOI |