AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH

dc.contributor.authorMikhailov, Nikolay
dc.date.accessioned2025-05-08T12:18:41Z
dc.date.available2025-05-08T12:18:41Z
dc.date.issued2025
dc.description.abstractThis thesis examines the challenges of corpus building for low-resource languages by developing an automated workflow for the Multimedia Corpus of Spoken Kazakh Language (MCSKL). Centering on Kazakh as a case study, this exploration examines the strategies for improving the efficiency of the corpus-building process by introducing automation. The challenge this thesis aimed to address is adapting tools initially designed for high-resource languages, such as English, to low-resource ones, like Kazakh, which was achieved via Whisper STT. However, since transcription is the early stage, the STT had to be a part of a more complex workflow. Thus, the second challenge was to create a workflow that could simplify the time-consuming steps of corpus building, ultimately aiming to convert language data from audio to a searchable database. The thesis also presents a closer look at MCSKL, a corpus of naturally occurring spoken interactional Kazakh, which served as the data source, and the case study for the automation effort feasibility. The study argues that adapting tools designed initially for high-resource languages, such as Whisper for speech-to-text and ELAN for annotation, requires not only technical fine-tuning but also methodological adjustments. In response, this thesis proposes a scalable and modular workflow that integrates these tools and supplements them with custom Python scripts, enabling researchers to efficiently process naturally occurring, unprompted spoken language data and output it in searchable formats compatible with search engines such as Apache Solr. Methodologically, the thesis adopts a discourse-functional approach to annotation, using intonation units rather than sentence-based segmentation to more faithfully represent spoken interaction. The research draws from both theoretical insights in corpus linguistics and practical implementation within a multilingual, interdisciplinary research collaboration. It also highlights the broader implications of automating linguistic documentation for underrepresented languages in the digital realm. The findings demonstrate that while full automation remains elusive, targeted computational support for the time-intensive phases can significantly reduce annotation time, improve consistency, and empower resource-limited teams. Ultimately, this thesis contributes to sustainable corpus-building practices and reinforces the role of corpus linguistics as a discipline central to both linguistic research and the development of language technology.
dc.identifier.citationMikhailov, N. (2025). Automation of corpus building from transcription to search: a case study of Kazakh. Nazarbayev University School of Sciences and Humanities.
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/8433
dc.language.isoen
dc.publisherNazarbayev University School of Sciences and Humanities
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/
dc.subjectType of access: Embargo
dc.subjectHUMANITIES and RELIGION::Languages and linguistics::Linguistic subjects::Computational linguistics
dc.subjectCorpus linguistics
dc.subjectLow-resource language
dc.subjectKazakh language
dc.subjectSpeech recognition
dc.subjectNatural language processing
dc.subjectMultimedia Corpus of Spoken Kazakh Language
dc.subjectMCSKL
dc.titleAUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH
dc.typePhD thesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis repository copy.pdf
Size:
1.47 MB
Format:
Adobe Portable Document Format
Description:
PhD thesis
Access status: Embargo until 2027-01-01 , Download

Collections