Abstract:
There is an apparent evolving interest in speech emotion recognition (SER), one of
the particular cases of a broader problem of multimedia pattern recognition. SER is
considered to possess the capability to enhance the communication efficiency between
human and artificial intelligence providing an emotional context to the machine. The
field has been developing fast with the emergence and increase in accessibility of deep
learning techniques recently. This potential critical benefit and novel techniques have
drawn the attention of many specialists in the field and generated a great number of
research papers that furnish diverse intricate methods. One of such methods involving
various data augmentation techniques has demonstrated high performance in this
field. This paper performs an analysis of various simple augmentation methods to
attempt to improve existing models. Particularly, this research focuses on state-ofthe-
art CNN models for RAVDESS, EMO-DB, and IEMOCAP datasets, and exploits
temporal, spatial, and spectral transformations of sound as an underlying method for
augmentation. As a result of exploiting simple augmentations, we achieved an increase
in performance for IEMOCAP model and positive effects comparable to their complex
counterparts for other datasets.