SHALA KAZAKH: A MIXED TRANSCRIPTION OF KAZAKH AND RUSSIAN

dc.contributor.authorTalap, Mukhamejan
dc.date.accessioned2025-06-03T07:04:10Z
dc.date.available2025-06-03T07:04:10Z
dc.date.issued2025-05-05
dc.description.abstractThis thesis addresses the challenge of Automatic Speech Recognition (ASR) for "Shala Kazakh", a widespread code-switching phenomenon in Kazakhstan, where Kazakh speakers integrate Russian words and expressions in their speech. While ASR systems are rapidly improving for high-resource languages, they struggle to recognise code-switching scenarios, especially in low-resource languages, like Kazakh. This research introduces a novel approach by training a state-of-the-art (SOTA) Whisper model on both monolingual Kazakh and Russian datasets, additionally training it on a freshly collected 52-hour code-switching dataset that captures bilingual speech patterns gathered from TikTok. The experimental results demonstrate that incorporating a Russian dataset significantly improves transcription for the code-switching scenario. This work provides a framework for developing robust ASR systems for other low-resource languages like Kazakh with similar code-switching scenarios, contributing both technological advances and language preservation.
dc.identifier.citationMukhamejan, T. (2025). Shala Kazakh: A Mixed Transcription of Kazakh and Russian. Nazarbayev University School of Engineering and Digital Sciences.
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/8711
dc.language.isoen
dc.publisherNazarbayev University School of Engineering and Digital Sciences
dc.subjectAutomatic Speech Recognition(ASR)
dc.subjectcode-switching
dc.subjectKazakh language
dc.subjectHUMANITIES and RELIGION::Languages and linguistics::Slavic languages::Russian language
dc.subjectShala Kazakh
dc.subjectWhisper
dc.subjectfine-tuning
dc.subjectlow-resource language
dc.subjectdataset
dc.subjectspeech corpus
dc.subjecttransfer learning
dc.subjectWord Error Rate(WER)
dc.subjectCharacter Error Rate(CER)
dc.subjectLanguage model
dc.subjecttype of access: embargo
dc.titleSHALA KAZAKH: A MIXED TRANSCRIPTION OF KAZAKH AND RUSSIAN
dc.typeMaster`s thesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Shala Kazakh A Mixed Transcription of Kazakh and Russian.pdf
Size:
566.59 KB
Format:
Adobe Portable Document Format
Description:
Master's Thesis
Access status: Embargo until 2026-05-31 , Download