CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API

Nazir, Zhumakhan

NUR Home
→
01.NU Schools
→
School of Engineering and Digital Sciences
→
Theses and Dissertations
→
View Item

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

dc.contributor.author	Nazir, Zhumakhan
dc.date.accessioned	2024-05-20T15:04:30Z
dc.date.available	2024-05-20T15:04:30Z
dc.date.issued	2024-04-22
dc.identifier.citation	Nazir, Zhumakhan. (2024) CNN performance analysis with different GPU libraries and Attention optimization on GPU using Tensor Core WMMA API. Nazarbayev University School of Engineering and Digital Sciences	en_US
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/7704
dc.description.abstract	Deep Learning has been very effective in tasks related to texts, images or time series data. With increased efficiency, demand for hardware and software capabilities has increased as well. Nvidia GPUs are the main hardware type that is used to both train and serve DL models of various sizes. It comes with high performance linear algebra (cuBLAS), deep neural networks (cuDNN) libraries and inference engines (TensorRT) which are used to accelerate computations. In addition to these, CUDA parallel programming software allows users to devise their own custom kernels for specific cases. This work consists of two parts. In the first part, three different im- plementations of Yolo, a convolutional neural network model, using cuBLAS, cuDNN and TensorRT were evaluated. By collecting the GPU performance metrics such as compute utilization and memory throughput, the most important metrics that greatly affect the performance of kernel from these libraries were identified. In the next part, we discussed the attention mechanism from Transformers architecture. The standard attention mechanism is bottlenecked by memory bandwidth since intermediate ker- nels need to read from and write to global memory. FlashAttention2 addressed this issue by fusing all kernels into one using cuTLASS library. It improved the efficiency of attention operation by several magnitudes. This work used TensorCore WMMA API to implement the similar CUDA kernel and explored potential improvements by selecting proper Q,K and V tile sizes. As a result, latency of the FA2 kernel was improved by 10%-40% percent on A100 and RTX3060 GPUs respectively.	en_US
dc.language.iso	en	en_US
dc.publisher	Nazarbayev University School of Engineering and Digital Sciences	en_US
dc.rights	Attribution 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	*
dc.subject	Type of access: Open access	en_US
dc.subject	Deep Learning	en_US
dc.subject	GPU	en_US
dc.subject	WMMA	en_US
dc.title	CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API	en_US
dc.type	Master's thesis	en_US
workflow.import.source	science