CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API

Nazir, Zhumakhan

NUR Home
→
01.NU Schools
→
School of Engineering and Digital Sciences
→
Theses and Dissertations
→
View Item

Advanced Search

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API

Nazir, Zhumakhan

URI: http://nur.nu.edu.kz/handle/123456789/7704

Date: 2024-04-22

Abstract:

Deep Learning has been very effective in tasks related to texts, images or time series data. With increased efficiency, demand for hardware and software capabilities has increased as well. Nvidia GPUs are the main hardware type that is used to both train and serve DL models of various sizes. It comes with high performance linear algebra (cuBLAS), deep neural networks (cuDNN) libraries and inference engines (TensorRT) which are used to accelerate computations. In addition to these, CUDA parallel programming software allows users to devise their own custom kernels for specific cases. This work consists of two parts. In the first part, three different im- plementations of Yolo, a convolutional neural network model, using cuBLAS, cuDNN and TensorRT were evaluated. By collecting the GPU performance metrics such as compute utilization and memory throughput, the most important metrics that greatly affect the performance of kernel from these libraries were identified. In the next part, we discussed the attention mechanism from Transformers architecture. The standard attention mechanism is bottlenecked by memory bandwidth since intermediate ker- nels need to read from and write to global memory. FlashAttention2 addressed this issue by fusing all kernels into one using cuTLASS library. It improved the efficiency of attention operation by several magnitudes. This work used TensorCore WMMA API to implement the similar CUDA kernel and explored potential improvements by selecting proper Q,K and V tile sizes. As a result, latency of the FA2 kernel was improved by 10%-40% percent on A100 and RTX3060 GPUs respectively.

Show full item record