DSpace Repository

CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

Show simple item record

dc.contributor.author Nazir, Zhumakhan
dc.date.accessioned 2024-05-20T15:04:30Z
dc.date.available 2024-05-20T15:04:30Z
dc.date.issued 2024-04-22
dc.identifier.citation Nazir, Zhumakhan. (2024) CNN performance analysis with different GPU libraries and Attention optimization on GPU using Tensor Core WMMA API. Nazarbayev University School of Engineering and Digital Sciences en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/7704
dc.description.abstract Deep Learning has been very effective in tasks related to texts, images or time series data. With increased efficiency, demand for hardware and software capabilities has increased as well. Nvidia GPUs are the main hardware type that is used to both train and serve DL models of various sizes. It comes with high performance linear algebra (cuBLAS), deep neural networks (cuDNN) libraries and inference engines (TensorRT) which are used to accelerate computations. In addition to these, CUDA parallel programming software allows users to devise their own custom kernels for specific cases. This work consists of two parts. In the first part, three different im- plementations of Yolo, a convolutional neural network model, using cuBLAS, cuDNN and TensorRT were evaluated. By collecting the GPU performance metrics such as compute utilization and memory throughput, the most important metrics that greatly affect the performance of kernel from these libraries were identified. In the next part, we discussed the attention mechanism from Transformers architecture. The standard attention mechanism is bottlenecked by memory bandwidth since intermediate ker- nels need to read from and write to global memory. FlashAttention2 addressed this issue by fusing all kernels into one using cuTLASS library. It improved the efficiency of attention operation by several magnitudes. This work used TensorCore WMMA API to implement the similar CUDA kernel and explored potential improvements by selecting proper Q,K and V tile sizes. As a result, latency of the FA2 kernel was improved by 10%-40% percent on A100 and RTX3060 GPUs respectively. en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Engineering and Digital Sciences en_US
dc.rights Attribution 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by/3.0/us/ *
dc.subject Type of access: Open access en_US
dc.subject Deep Learning en_US
dc.subject GPU en_US
dc.subject WMMA en_US
dc.title CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API en_US
dc.type Master's thesis en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution 3.0 United States Except where otherwise noted, this item's license is described as Attribution 3.0 United States