Throughput-Oriented GPU Memory Allocation (PPoPP 2019 - Main Conference)

Who

Isaac Gelado, Michael Garland

Track

PPoPP 2019 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Guadalajara, Mexico City, Monterrey.

Use conference time zone: (GMT-05:00) Guadalajara, Mexico City, MonterreySelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 18 Feb 2019 10:55 - 11:20 at Salon 12/13 - Session 2: Heterogeneous Platforms and GPU Chair(s): Xu Liu

Abstract

Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency.

We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency.

Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.

DOI

https://doi.org/10.1145/3293883.3295727

Isaac Gelado

NVIDIA

Michael Garland

NVIDIA Research