Deep Learning has been very effective in tasks related to texts, images or time series
data. With increased efficiency, demand for hardware and software capabilities has
increased as well. Nvidia GPUs are the main hardware type that is used to both
train and serve DL models of various sizes. It comes with high performance linear
algebra (cuBLAS), deep neural networks (cuDNN) libraries and inference engines
(TensorRT) which are used to accelerate computations. In addition to these, CUDA
parallel programming software allows users to devise their own custom kernels for
specific cases. This work consists of two parts. In the first part, three different im-
plementations of Yolo, a convolutional neural network model, using cuBLAS, cuDNN
and TensorRT were evaluated. By collecting the GPU performance metrics such as
compute utilization and memory throughput, the most important metrics that greatly
affect the performance of kernel from these libraries were identified. In the next part,
we discussed the attention mechanism from Transformers architecture. The standard
attention mechanism is bottlenecked by memory bandwidth since intermediate ker-
nels need to read from and write to global memory. FlashAttention2 addressed this
issue by fusing all kernels into one using cuTLASS library. It improved the efficiency
of attention operation by several magnitudes. This work used TensorCore WMMA
API to implement the similar CUDA kernel and explored potential improvements by
selecting proper Q,K and V tile sizes. As a result, latency of the FA2 kernel was
improved by 10%-40% percent on A100 and RTX3060 GPUs respectively.