Abstract:
Serenity and fluency are the most important synthesis qualities expected from text-tospeech.
This project introduces a multilingual text-to-speech (TTS) engine, which is
capable of reproducing high-quality speech in English, Kazakh and Russian languages.
The main idea is to address the limitation of existing TTS that have one voice in one
language. So we have 3 languages at the same time.
A text-to-speech synthesis system usually consists of several stages: a text analysis
interface, an acoustic model, and a sound synthesis module. For synthesis, we
use Tacotron, an end-to-end generative text-to-speech model that synthesizes speech
directly from symbols.
Also described a high-quality speech dataset for Kazakh, Russian and English languages.
The dataset contains 40 hours per language of transcribed audio recordings
spoken by a Female professional speaker. The publicly available large-scale synthesis
was developed to promote multilingual text-to-speech (TTS) applications in academia
and industry. This paper outlined our experience by describing the dataset development
procedures, facing challenges, and discussing important future directions. To
evaluate the resulting system, we conducted subjective assessment tests based on the
Likert system.