MULTIMODAL EMOTION RECOGNITION WITH EEG, AUDIO, AND VIDEO USING TRANSFORMER ENCODER FOR INTERMEDIATE FUSION

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Nazarbayev University School of Engineering and Digital Sciences

Abstract

Multimodal Emotion Recognition (MER) has increasingly relied on integrating diverse data sources such as audio, video, and electroencephalogram (EEG). Despite advances, effectively fusing these modalities still remains a challenging problem. In this paper, we propose a novel intermediate fusion framework utilizing custom convolutional neural networks (CNNs) tailored for each modality—audio, video, and EEG—combined with transformer-based fusion blocks employing multi-head attention mechanisms. Our approach integrates modality-specific features through intermediate fusion layers, allowing better emphasis on critical emotional cues. We benchmark our model on the EAV dataset and a recently proposed model utilizing this dataset, demonstrating that our proposed intermediate fusion improves emotion recognition performance compared to unimodal and recent baselines

Description

Citation

Chokushev, N., Darigulov, B., Akhmurzin, G., Turtkarayeva, A. (2025). Multimodal Emotion Recognition with EEG, Audio, and Video using Transformer Encoder for Intermediate Fusion. Nazarbayev University School of Engineering and Digital Sciences

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as CC0 1.0 Universal