CNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA API

dc.contributor.authorNazir, Zhumakhan
dc.date.accessioned2024-05-20T15:04:30Z
dc.date.available2024-05-20T15:04:30Z
dc.date.issued2024-04-22
dc.description.abstractDeep Learning has been very effective in tasks related to texts, images or time series data. With increased efficiency, demand for hardware and software capabilities has increased as well. Nvidia GPUs are the main hardware type that is used to both train and serve DL models of various sizes. It comes with high performance linear algebra (cuBLAS), deep neural networks (cuDNN) libraries and inference engines (TensorRT) which are used to accelerate computations. In addition to these, CUDA parallel programming software allows users to devise their own custom kernels for specific cases. This work consists of two parts. In the first part, three different implementations of Yolo, a convolutional neural network model, using cuBLAS, cuDNN and TensorRT were evaluated. By collecting the GPU performance metrics such as compute utilization and memory throughput, the most important metrics that greatly affect the performance of kernel from these libraries were identified. In the next part, we discussed the attention mechanism from Transformers architecture. The standard attention mechanism is bottlenecked by memory bandwidth since intermediate kernels need to read from and write to global memory. FlashAttention2 addressed this issue by fusing all kernels into one using cuTLASS library. It improved the efficiency of attention operation by several magnitudes. This work used TensorCore WMMA API to implement the similar CUDA kernel and explored potential improvements by selecting proper Q,K and V tile sizes. As a result, latency of the FA2 kernel was improved by 10%-40% percent on A100 and RTX3060 GPUs respectively.en_US
dc.identifier.citationNazir, Zhumakhan. (2024) CNN performance analysis with different GPU libraries and Attention optimization on GPU using Tensor Core WMMA API. Nazarbayev University School of Engineering and Digital Sciencesen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/7704
dc.language.isoenen_US
dc.publisherNazarbayev University School of Engineering and Digital Sciencesen_US
dc.rightsAttribution 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/*
dc.subjecttype of access: open accessen_US
dc.subjectDeep Learningen_US
dc.subjectGPUen_US
dc.subjectWMMAen_US
dc.titleCNN PERFORMANCE ANALYSIS WITH DIFFERENT GPU LIBRARIES AND ATTENTION OPTIMIZATION ON GPU USING TENSOR CORE WMMA APIen_US
dc.typeMaster's thesisen_US
workflow.import.sourcescience

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis_Zhumakhan_Nazir.pdf
Size:
1012.6 KB
Format:
Adobe Portable Document Format
Description:
Thesis