AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH
| dc.contributor.author | Mikhailov, Nikolay | |
| dc.date.accessioned | 2025-05-08T12:18:41Z | |
| dc.date.available | 2025-05-08T12:18:41Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | This thesis examines the challenges of corpus building for low-resource languages by developing an automated workflow for the Multimedia Corpus of Spoken Kazakh Language (MCSKL). Centering on Kazakh as a case study, this exploration examines the strategies for improving the efficiency of the corpus-building process by introducing automation. The challenge this thesis aimed to address is adapting tools initially designed for high-resource languages, such as English, to low-resource ones, like Kazakh, which was achieved via Whisper STT. However, since transcription is the early stage, the STT had to be a part of a more complex workflow. Thus, the second challenge was to create a workflow that could simplify the time-consuming steps of corpus building, ultimately aiming to convert language data from audio to a searchable database. The thesis also presents a closer look at MCSKL, a corpus of naturally occurring spoken interactional Kazakh, which served as the data source, and the case study for the automation effort feasibility. The study argues that adapting tools designed initially for high-resource languages, such as Whisper for speech-to-text and ELAN for annotation, requires not only technical fine-tuning but also methodological adjustments. In response, this thesis proposes a scalable and modular workflow that integrates these tools and supplements them with custom Python scripts, enabling researchers to efficiently process naturally occurring, unprompted spoken language data and output it in searchable formats compatible with search engines such as Apache Solr. Methodologically, the thesis adopts a discourse-functional approach to annotation, using intonation units rather than sentence-based segmentation to more faithfully represent spoken interaction. The research draws from both theoretical insights in corpus linguistics and practical implementation within a multilingual, interdisciplinary research collaboration. It also highlights the broader implications of automating linguistic documentation for underrepresented languages in the digital realm. The findings demonstrate that while full automation remains elusive, targeted computational support for the time-intensive phases can significantly reduce annotation time, improve consistency, and empower resource-limited teams. Ultimately, this thesis contributes to sustainable corpus-building practices and reinforces the role of corpus linguistics as a discipline central to both linguistic research and the development of language technology. | |
| dc.identifier.citation | Mikhailov, N. (2025). Automation of corpus building from transcription to search: a case study of Kazakh. Nazarbayev University School of Sciences and Humanities. | |
| dc.identifier.uri | https://nur.nu.edu.kz/handle/123456789/8433 | |
| dc.language.iso | en | |
| dc.publisher | Nazarbayev University School of Sciences and Humanities | |
| dc.rights | Attribution-NonCommercial-ShareAlike 3.0 United States | en |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/3.0/us/ | |
| dc.subject | Type of access: Embargo | |
| dc.subject | HUMANITIES and RELIGION::Languages and linguistics::Linguistic subjects::Computational linguistics | |
| dc.subject | Corpus linguistics | |
| dc.subject | Low-resource language | |
| dc.subject | Kazakh language | |
| dc.subject | Speech recognition | |
| dc.subject | Natural language processing | |
| dc.subject | Multimedia Corpus of Spoken Kazakh Language | |
| dc.subject | MCSKL | |
| dc.title | AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH | |
| dc.type | PhD thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Thesis repository copy.pdf
- Size:
- 1.47 MB
- Format:
- Adobe Portable Document Format
- Description:
- PhD thesis