MULTIMODAL EMOTION RECOGNITION WITH EEG, AUDIO, AND VIDEO USING TRANSFORMER ENCODER FOR INTERMEDIATE FUSION
Loading...
Files
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
Multimodal Emotion Recognition (MER) has increasingly relied on integrating diverse data sources such as audio, video, and electroencephalogram (EEG). Despite advances, effectively fusing these modalities still remains a challenging problem. In this paper, we propose a novel intermediate fusion framework utilizing custom convolutional neural networks (CNNs) tailored for each modality—audio, video, and EEG—combined with transformer-based fusion blocks employing multi-head attention mechanisms. Our approach integrates modality-specific features through intermediate fusion layers, allowing better emphasis on critical emotional cues. We benchmark our model on the EAV dataset and a recently proposed model utilizing this dataset, demonstrating that our proposed intermediate fusion improves emotion recognition performance compared to unimodal and recent baselines
Description
Citation
Chokushev, N., Darigulov, B., Akhmurzin, G., Turtkarayeva, A. (2025). Multimodal Emotion Recognition with EEG, Audio, and Video using Transformer Encoder for Intermediate Fusion. Nazarbayev University School of Engineering and Digital Sciences
Collections
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as CC0 1.0 Universal
