SPEAKINGFACES: A LARGE-SCALE MULTIMODAL DATASET OF VOICE COMMANDS WITH VISUAL AND THERMAL VIDEO STREAMS

Abdrakhmanova, Madina; Kuzdeuov, Askat; Jarju, Sheikh; Khassanov, Yerbolat; Lewis, Michael; Varol, Huseyin Atakan

NUR Home
→
01.NU Schools
→
School of Engineering and Digital Sciences
→
Articles
→
View Item

dc.contributor.author	Abdrakhmanova, Madina
dc.contributor.author	Kuzdeuov, Askat
dc.contributor.author	Jarju, Sheikh
dc.contributor.author	Khassanov, Yerbolat
dc.contributor.author	Lewis, Michael
dc.contributor.author	Varol, Huseyin Atakan
dc.date.accessioned	2021-08-27T10:23:26Z
dc.date.available	2021-08-27T10:23:26Z
dc.date.issued	2021-05-16
dc.identifier.citation	Abdrakhmanova, M., Kuzdeuov, A., Jarju, S., Khassanov, Y., Lewis, M., & Varol, H. A. (2021). SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams. Sensors, 21(10), 3465. https://doi.org/10.3390/s21103465	en_US
dc.identifier.issn	1424-8220
dc.identifier.uri	https://www.mdpi.com/1424-8220/21/10/3465
dc.identifier.uri	https://doi.org/10.3390/s21103465
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/5731
dc.description.abstract	We present SpeakingFaces as a publicly-available large-scale multimodal dataset developed to support machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human–computer interaction, biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data were collected from 142 subjects, yielding over 13,000 instances of synchronized data (∼3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.	en_US
dc.language.iso	en	en_US
dc.publisher	MDPI AG	en_US
dc.relation.ispartofseries	Sensors;2021, 21(10), 3465; https://doi.org/10.3390/s21103465
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	*
dc.subject	Computer vision	en_US
dc.subject	Datasets	en_US
dc.subject	Domain transfer	en_US
dc.subject	Human–computer interaction	en_US
dc.subject	Multimodal learning	en_US
dc.subject	Thermal imaging	en_US
dc.subject	Type of access: Open Access	en_US
dc.title	SPEAKINGFACES: A LARGE-SCALE MULTIMODAL DATASET OF VOICE COMMANDS WITH VISUAL AND THERMAL VIDEO STREAMS	en_US
dc.type	Article	en_US
workflow.import.source	science