DSpace Repository

DEEP TRANSFORMER NEURAL NETWORKS FOR REAL-TIME SPEECH EMOTION RECOGNITION: A MULTIMODAL APPROACH

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

Show simple item record

dc.contributor.author Akkassov, Ayan
dc.date.accessioned 2024-06-23T19:14:30Z
dc.date.available 2024-06-23T19:14:30Z
dc.date.issued 2023-07-27
dc.identifier.citation Akkassov, A. (2023). Deep Transformer Neural Networks for Real-time Speech Emotion Recognition: A Multimodal Approach. Nazarbayev University School of Engineering and Digital Sciences en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/7968
dc.description.abstract Emotion plays a pivotal role in human communication, facilitating mutual understanding among individuals. In the context of human-computer interaction and computer-based speech emotion recognition (SER), accurate ER holds great importance. SER is a complex and challenging task within data science, given its diverse applications across various fields. This thesis delves into the implementation and optimization of Transformer mechanisms for SER, aiming to develop a robust, real-time SER system capable of handling multilabel classification. To achieve this objective, efficient preprocessing and fusion methodologies are designed to handle real-life speech data effectively. Moreover, a novel SER architecture is proposed, which leverages large Automatic Speech Recognition (ASR) models. The contributions of this research comprise the creation of a robust Transformer-based multimodal model, incorporating a feature extraction mechanism based on Convolutional Neural Networks (CNNs). The resulting system achieves real-time SER with high accuracy. Extensive evaluations on well-recognized datasets. Our Finetuned Spectrogram-based SER Model yielded an accuracy of 62.43% on the RAVDESS dataset. The Encoder-based SER Model achieved accuracies of 73.11%, 68.56%, and 83.74% on RAVDESS, IEMOCAP, and EMO-DB datasets, respectively. Remarkably, our ASR-based SER Model demonstrated the highest accuracies of 77.41%, 79.22%, and 90.60% on the same datasets. Furthermore, our Multimodal Model reached an impressive accuracy of 81.26% on the IEMOCAP dataset. Compared to previous studies, our models consistently outperform them on all datasets, thereby substantiating the effectiveness of our methodologies and underlining their potential in advancing the field of emotion recognition. The thesis concludes by discussing the primary challenges and issues encountered by the current state-of-the-art models, offering valuable insights for future research in this domain. en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Engineering and Digital Sciences en_US
dc.rights Attribution-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nd/3.0/us/ *
dc.subject SER en_US
dc.subject Deep Learning en_US
dc.subject Transformer en_US
dc.subject Type of access: Restricted en_US
dc.title DEEP TRANSFORMER NEURAL NETWORKS FOR REAL-TIME SPEECH EMOTION RECOGNITION: A MULTIMODAL APPROACH en_US
dc.type Master's thesis en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NoDerivs 3.0 United States