SHALA KAZAKH: A MIXED TRANSCRIPTION OF KAZAKH AND RUSSIAN
Loading...
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Nazarbayev University School of Engineering and Digital Sciences
Abstract
This thesis addresses the challenge of Automatic Speech Recognition (ASR) for "Shala Kazakh", a widespread code-switching phenomenon in Kazakhstan, where Kazakh speakers integrate Russian words and expressions in their speech. While ASR systems are rapidly improving for high-resource languages, they struggle to recognise code-switching scenarios, especially in low-resource languages, like Kazakh. This research introduces a novel approach by training a state-of-the-art (SOTA) Whisper model on both monolingual Kazakh and Russian datasets, additionally training it on a freshly collected 52-hour code-switching dataset that captures bilingual speech patterns gathered from TikTok. The experimental results demonstrate that incorporating a Russian dataset significantly improves transcription for the code-switching scenario. This work provides a framework for developing robust ASR systems for other low-resource languages like Kazakh with similar code-switching scenarios, contributing both technological advances and language preservation.
Description
Keywords
Automatic Speech Recognition(ASR), code-switching, Kazakh language, HUMANITIES and RELIGION::Languages and linguistics::Slavic languages::Russian language, Shala Kazakh, Whisper, fine-tuning, low-resource language, dataset, speech corpus, transfer learning, Word Error Rate(WER), Character Error Rate(CER), Language model, type of access: embargo
Citation
Mukhamejan, T. (2025). Shala Kazakh: A Mixed Transcription of Kazakh and Russian. Nazarbayev University School of Engineering and Digital Sciences.