ENHANCED MULTIMODAL EMOTION RECOGNITION SYSTEM WITH DEEP LEARNING AND HYBRID FUSION
Loading...
Files
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
This paper presents a multimodal emotion recog-
nition (MER) system that combines deep learning and hybrid
transformer fusion to analyze emotional expressions from raw
user-uploaded videos. The system processes and fuses informa-
tion from three modalities—video, audio, and text—to identify
six primary emotions: happiness, sadness, anger, frustration,
excited, and neutral. Leveraging transformers and pre-trained
encoders such as RoBERTa and VGG, our architecture captures
intra- and inter-modal dependencies to improve classification
performance. The MER model was integrated into a user-
friendly web application, enabling real-time emotion inference
via a React-based frontend and a Flask backend. Evaluated
on the IEMOCAP dataset, the system achieved a validation
weighted F1-score of 0.5128 and highlighted strong recognition
capabilities for expressive emotions like sadness and anger. This
work demonstrates the potential of multimodal deep learning
approaches in emotion-aware systems and offers a scalable
foundation for future applications in mental health monitoring,
human-computer interaction, and affective computing.
Description
Citation
Rymkan, A., Sarsengaliyev, D., Khamiyev, A., Baidussenov, A., & Mukhametzhanov, M. (2024). Enhanced multimodal emotion recognition system with deep learning and hybrid fusion. Nazarbayev University School of Engineering and Digital Sciences.
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution 3.0 United States
