Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency.
We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency.
Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.
Mon 18 Feb (GMT-05:00) Guadalajara, Mexico City, Monterrey change
|10:55 - 11:20|
|11:20 - 11:45|
Hao WangThe Ohio State University, USA, Liang GengThe Ohio State University, USA, Rubao LeeUnited Parallel Computing Corporation, USA, Kaixi HouVirginia Tech, USA, Yanfeng Zhang, Xiaodong ZhangThe Ohio State University, USADOI
|11:45 - 12:10|
Troels HenriksenUniversity of Copenhagen, Denmark, Frederik ThorøeDIKU, University of Copenhagen, Martin ElsmanUniversity of Copenhagen, Denmark, Cosmin OanceaUniversity of Copenhagen, DenmarkDOI
|12:10 - 12:35|
Martin WinterGraz University of Technology, Austria, Daniel MlakarGraz University of Technology, Austria, Rhaleb ZayerMax Planck Institute for Informatics, Hans-Peter SeidelMax Planck Institute for Informatics, Markus SteinbergerGraz University of Technology, AustriaDOI