AUDIO-VISUAL SPEECH RECOGNITION USING VISUAL AND THERMAL IMAGES
Loading...
Date
2021-08
Authors
Koishybayeva, Zhaniya
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
In this thesis I examine the hypothesis that the performance of lipreading systems can
be improved by including thermal image data in combination with the usual visual
image streams. I test the hypothesis by constructing a system based on the Lip2Wav
model for lipreading using deep learning methods. The system takes silent video as
an input and generates synthesized audio as an output. System performance is evaluated
using standard metrics such as the Word Recognition Rate (WRR), to assess
the contribution of the thermal input to the accuracy of the lipreading system, and
qualitative assessments of the synthesized audio such as Short-Term Objective Intelligibility
(STOI) and Extended STOI (ESTOI), and Perceptual Evaluation of Speech
Quality (PESQ). The model is trained using three variations of input channels: visual
images only, thermal images only, and a synthesis of the visual and thermal images.
The model uses a novel dataset, SpeakingFaces LipReading (SFLR), comprised of
aligned streams of visual and thermal images of a person reading short imperative
commands that are representative of typical human-computer interaction with devices
such as personal digital assistants. The results as shown in Table 5.2 suggest
that with the inclusion of aligned thermal data I was able to approximate the system
performance from the previously published results. However the addition of thermal
image stream did not show improvement in the performance.
Description
Keywords
Type of access: Open Access, Speech Recognition, Thermal Images, Visual Images, Research Subject Categories::TECHNOLOGY, Word Recognition Rate, WRR
Citation
Koishybayeva, Z. (2021). Audio-Visual Speech Recognition Using Visual and Thermal Images (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstan