DEEP TRANSFORMER NEURAL NETWORKS FOR REAL-TIME SPEECH EMOTION RECOGNITION: A MULTIMODAL APPROACH

dc.contributor.authorAkkassov, Ayan
dc.date.accessioned2024-06-23T19:14:30Z
dc.date.available2024-06-23T19:14:30Z
dc.date.issued2023-07-27
dc.description.abstractEmotion plays a pivotal role in human communication, facilitating mutual understanding among individuals. In the context of human-computer interaction and computer-based speech emotion recognition (SER), accurate ER holds great importance. SER is a complex and challenging task within data science, given its diverse applications across various fields. This thesis delves into the implementation and optimization of Transformer mechanisms for SER, aiming to develop a robust, real-time SER system capable of handling multilabel classification. To achieve this objective, efficient preprocessing and fusion methodologies are designed to handle real-life speech data effectively. Moreover, a novel SER architecture is proposed, which leverages large Automatic Speech Recognition (ASR) models. The contributions of this research comprise the creation of a robust Transformer-based multimodal model, incorporating a feature extraction mechanism based on Convolutional Neural Networks (CNNs). The resulting system achieves real-time SER with high accuracy. Extensive evaluations on well-recognized datasets. Our Finetuned Spectrogram-based SER Model yielded an accuracy of 62.43% on the RAVDESS dataset. The Encoder-based SER Model achieved accuracies of 73.11%, 68.56%, and 83.74% on RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. Remarkably, our ASR-based SER Model demonstrated the highest accuracies of 77.41%, 79.22%, and 90.60% on the same datasets. Furthermore, our Multimodal Model reached an impressive accuracy of 81.26% on the IEMOCAP dataset. Compared to previous studies, our models consistently outperform them on all datasets, thereby substantiating the effectiveness of our methodologies and underlining their potential in advancing the field of emotion recognition. The thesis concludes by discussing the primary challenges and issues encountered by the current state-of-the-art models, offering valuable insights for future research in this domain.en_US
dc.identifier.citationAkkassov, A. (2023). Deep Transformer Neural Networks for Real-time Speech Emotion Recognition: A Multimodal Approach. Nazarbayev University School of Engineering and Digital Sciencesen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/7968
dc.language.isoenen_US
dc.publisherNazarbayev University School of Engineering and Digital Sciencesen_US
dc.rightsAttribution-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nd/3.0/us/*
dc.subjectSERen_US
dc.subjectDeep Learningen_US
dc.subjectTransformeren_US
dc.subjecttype of access: restricted accessen_US
dc.titleDEEP TRANSFORMER NEURAL NETWORKS FOR REAL-TIME SPEECH EMOTION RECOGNITION: A MULTIMODAL APPROACHen_US
dc.typeMaster's thesisen_US
workflow.import.sourcescience

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ayan_Akkassov_Thesis.pdf
Size:
2.18 MB
Format:
Adobe Portable Document Format
Description:
Master`s thesis