SHALA KAZAKH: A MIXED TRANSCRIPTION OF KAZAKH AND RUSSIAN
| dc.contributor.author | Talap, Mukhamejan | |
| dc.date.accessioned | 2025-06-03T07:04:10Z | |
| dc.date.available | 2025-06-03T07:04:10Z | |
| dc.date.issued | 2025-05-05 | |
| dc.description.abstract | This thesis addresses the challenge of Automatic Speech Recognition (ASR) for "Shala Kazakh", a widespread code-switching phenomenon in Kazakhstan, where Kazakh speakers integrate Russian words and expressions in their speech. While ASR systems are rapidly improving for high-resource languages, they struggle to recognise code-switching scenarios, especially in low-resource languages, like Kazakh. This research introduces a novel approach by training a state-of-the-art (SOTA) Whisper model on both monolingual Kazakh and Russian datasets, additionally training it on a freshly collected 52-hour code-switching dataset that captures bilingual speech patterns gathered from TikTok. The experimental results demonstrate that incorporating a Russian dataset significantly improves transcription for the code-switching scenario. This work provides a framework for developing robust ASR systems for other low-resource languages like Kazakh with similar code-switching scenarios, contributing both technological advances and language preservation. | |
| dc.identifier.citation | Mukhamejan, T. (2025). Shala Kazakh: A Mixed Transcription of Kazakh and Russian. Nazarbayev University School of Engineering and Digital Sciences. | |
| dc.identifier.uri | https://nur.nu.edu.kz/handle/123456789/8711 | |
| dc.language.iso | en | |
| dc.publisher | Nazarbayev University School of Engineering and Digital Sciences | |
| dc.subject | Automatic Speech Recognition(ASR) | |
| dc.subject | code-switching | |
| dc.subject | Kazakh language | |
| dc.subject | HUMANITIES and RELIGION::Languages and linguistics::Slavic languages::Russian language | |
| dc.subject | Shala Kazakh | |
| dc.subject | Whisper | |
| dc.subject | fine-tuning | |
| dc.subject | low-resource language | |
| dc.subject | dataset | |
| dc.subject | speech corpus | |
| dc.subject | transfer learning | |
| dc.subject | Word Error Rate(WER) | |
| dc.subject | Character Error Rate(CER) | |
| dc.subject | Language model | |
| dc.subject | type of access: embargo | |
| dc.title | SHALA KAZAKH: A MIXED TRANSCRIPTION OF KAZAKH AND RUSSIAN | |
| dc.type | Master`s thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Shala Kazakh A Mixed Transcription of Kazakh and Russian.pdf
- Size:
- 566.59 KB
- Format:
- Adobe Portable Document Format
- Description:
- Master's Thesis