PPoPP 2019
Sat 16 - Wed 20 February 2019 Washington, DC, United States

GPU utilization, measured as occupancy, is limited by the parallel threads’ combined usage of on-chip resources, such as registers and the programmer-managed shared memory. If the resource demand cannot be met, GPUs will reduce the number of concurrent threads, impacting the program performance. Registers commonly are the occupancy-limiting factor, while shared memory tends to be underused. We have observed that registers are the occupancy limiters in about 16% of the applications. The approach chosen by the nvcc compiler spills excessive registers to the out-of-chip local memory. This method ignores the shared memory, leaving the on-chip resources underutilized. We present a novel compiler technique to reduce the register demand by replacing physical registers with shared memory. The cost of the added shared memory accesses is less than the performance gained from improved occupancy. The technique transforms the GPU assembly code to obtain such effect. The paper also introduces a compile-time performance predictor, which analyzes the assembly code and considers the combined effect of code efficiency, represented by instruction stalls, and occupancy to choose the best code variants. The proposed techniques achieve up to 18% speedup over the nvcc compiler with a geometric mean speedup of 9%. Results also show an average gain of 19% over the closest research contribution.