AUDIO-VISUAL SPEECH RECOGNITION USING VISUAL AND THERMAL IMAGES

dc.contributor.authorKoishybayeva, Zhaniya
dc.date.accessioned2021-08-10T02:46:22Z
dc.date.available2021-08-10T02:46:22Z
dc.date.issued2021-08
dc.description.abstractIn this thesis I examine the hypothesis that the performance of lipreading systems can be improved by including thermal image data in combination with the usual visual image streams. I test the hypothesis by constructing a system based on the Lip2Wav model for lipreading using deep learning methods. The system takes silent video as an input and generates synthesized audio as an output. System performance is evaluated using standard metrics such as the Word Recognition Rate (WRR), to assess the contribution of the thermal input to the accuracy of the lipreading system, and qualitative assessments of the synthesized audio such as Short-Term Objective Intelligibility (STOI) and Extended STOI (ESTOI), and Perceptual Evaluation of Speech Quality (PESQ). The model is trained using three variations of input channels: visual images only, thermal images only, and a synthesis of the visual and thermal images. The model uses a novel dataset, SpeakingFaces LipReading (SFLR), comprised of aligned streams of visual and thermal images of a person reading short imperative commands that are representative of typical human-computer interaction with devices such as personal digital assistants. The results as shown in Table 5.2 suggest that with the inclusion of aligned thermal data I was able to approximate the system performance from the previously published results. However the addition of thermal image stream did not show improvement in the performance.en_US
dc.identifier.citationKoishybayeva, Z. (2021). Audio-Visual Speech Recognition Using Visual and Thermal Images (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstanen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/5673
dc.language.isoenen_US
dc.publisherNazarbayev University School of Engineering and Digital Sciencesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectType of access: Open Accessen_US
dc.subjectSpeech Recognitionen_US
dc.subjectThermal Imagesen_US
dc.subjectVisual Imagesen_US
dc.subjectResearch Subject Categories::TECHNOLOGYen_US
dc.subjectWord Recognition Rateen_US
dc.subjectWRRen_US
dc.titleAUDIO-VISUAL SPEECH RECOGNITION USING VISUAL AND THERMAL IMAGESen_US
dc.typeMaster's thesisen_US
workflow.import.sourcescience

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Thesis - Zhaniya Koishybayeva.pdf
Size:
5.63 MB
Format:
Adobe Portable Document Format
Description:
Thesis
Loading...
Thumbnail Image
Name:
Presentation - Zhaniya Koishybayeva.pdf
Size:
1.26 MB
Format:
Adobe Portable Document Format
Description:
Presentation
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.28 KB
Format:
Item-specific license agreed upon to submission
Description: