A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs (PPoPP 2019 - Main Conference)

Who

Xiuhong Li, Eric Liang, Shengen Yan, Jia Liancheng, Yinghan Li

Track

PPoPP 2019 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Guadalajara, Mexico City, Monterrey.

Use conference time zone: (GMT-05:00) Guadalajara, Mexico City, MonterreySelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 19 Feb 2019 12:10 - 12:35 at Salon 12/13 - Session 6, Best Paper Candidates Chair(s): Rudolf Eigenmann

Abstract

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the parallelism within and between tiles. The tiling hierarchy closely mirrors the thread hierarchy on GPUs. In practice, GPUs can fully unleash its computing power when the matrix size is large and there are sufficient number of tiles and workload for each tile. However, in many real-world applications especially deep learning domains, the matrix size is small. To this end, prior work proposes batched-GEMMs to process a group of small independent GEMMs together by designing a single CUDA kernel for all of these GEMMs.

However, the current support for batched-GEMMs is still rudimentary. Tiling and batching are tightly correlated. A large tile size can exploit data reuse, but it will decrease the thread-level parallelism, which further decrease the optimization space for the batching. A small tile size can increase the thread-level parallelism and then provide larger optimization space for the batching, but at the cost of sacrificing data reuse. In this paper, we propose a coordinated tiling and batching framework for accelerating GEMM on GPUs. It is a two-phase framework, which consists of a tiling engine and a batching engine to perform efficient batched-GEMMs on GPUs. Tiling engine partitions the GEMMs into independent tiles and batching engine assigns the tiles to thread blocks. Moreover, we propose a general programming interface to describe the coordinated tiling and batching solution. Finally, experiment evaluation results on synthetic batched GEMM cases show that our framework can achieve about 1.40X performance speedup on average over the state-of-the-art technique. We also use GoogleNet as a real-world case study and our framework achieves 1.23X speedup.

DOI

https://doi.org/10.1145/3293883.3295734

Xiuhong Li

Peking University

Eric Liang

Peking University

Shengen Yan

SenseTime

Jia Liancheng

Peking University

Yinghan Li

SenseTime

Time Zone

The program is currently displayed in (GMT-05:00) Guadalajara, Mexico City, Monterrey.

Use conference time zone: (GMT-05:00) Guadalajara, Mexico City, MonterreySelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 19 Feb
Displayed time zone: Guadalajara, Mexico City, Monterrey change

10:55 - 12:35	Session 6, Best Paper CandidatesMain Conference at Salon 12/13 Chair(s): Rudolf Eigenmann University of Delaware

10:55 25m Talk		Lightweight Hardware Transactional Memory Profiling Main Conference Qingsen Wang College of William and Mary, Pengfei Su College of William and Mary, Milind Chabbi Uber Technologies, Xu Liu College of William and Mary DOI
11:20 25m Talk		A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs Main Conference Ke Meng , Jiajia Li Georgia Institute of Technology, Pacific Northwest National Laboratory, Guangming Tan Chinese Academy of Sciences(CAS), Ninghui Sun State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences DOI
11:45 25m Talk		Provably and Practically Efficient Granularity Control Main Conference Umut A. Acar Carnegie Mellon University, Vitaly Aksenov Inria & ITMO University, Arthur Charguéraud Inria, Mike Rainey Indiana University, USA DOI
12:10 25m Talk		A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs Main Conference Xiuhong Li Peking University, Eric Liang Peking University, Shengen Yan SenseTime, Jia Liancheng Peking University, Yinghan Li SenseTime DOI