Abstract:
Emotion plays a pivotal role in human communication, facilitating mutual understanding among individuals. In the context of human-computer interaction and computer-based speech emotion recognition (SER), accurate ER holds great importance. SER is a complex and challenging task within data science, given its diverse applications across various fields. This thesis delves into the implementation and optimization of Transformer mechanisms for SER, aiming to develop a robust, real-time SER system capable of handling multilabel classification. To achieve this objective, efficient preprocessing and fusion methodologies are designed to handle real-life speech data effectively. Moreover, a novel SER architecture is proposed, which leverages large Automatic Speech Recognition (ASR) models. The contributions of this research comprise the creation of a robust Transformer-based multimodal model, incorporating a feature extraction mechanism based on Convolutional Neural Networks (CNNs). The resulting system achieves real-time SER with high accuracy. Extensive evaluations on well-recognized datasets. Our Finetuned Spectrogram-based SER Model yielded an accuracy of 62.43% on the RAVDESS dataset. The Encoder-based SER Model achieved accuracies of 73.11%, 68.56%, and 83.74% on RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. Remarkably, our ASR-based SER Model demonstrated the highest accuracies of 77.41%, 79.22%, and 90.60% on the same datasets. Furthermore, our Multimodal Model reached an impressive accuracy of 81.26% on the IEMOCAP dataset. Compared to previous studies, our models consistently outperform them on all datasets, thereby substantiating the effectiveness of our methodologies and underlining their potential in advancing the field of emotion recognition. The thesis concludes by discussing the primary challenges and issues encountered by the current state-of-the-art models, offering valuable insights for future research in this domain.