AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH

Mikhailov, Nikolay

AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH

dc.contributor.author	Mikhailov, Nikolay
dc.date.accessioned	2025-05-08T12:18:41Z
dc.date.available	2025-05-08T12:18:41Z
dc.date.issued	2025
dc.description.abstract	This thesis examines the challenges of corpus building for low-resource languages by developing an automated workflow for the Multimedia Corpus of Spoken Kazakh Language (MCSKL). Centering on Kazakh as a case study, this exploration examines the strategies for improving the efficiency of the corpus-building process by introducing automation. The challenge this thesis aimed to address is adapting tools initially designed for high-resource languages, such as English, to low-resource ones, like Kazakh, which was achieved via Whisper STT. However, since transcription is the early stage, the STT had to be a part of a more complex workflow. Thus, the second challenge was to create a workflow that could simplify the time-consuming steps of corpus building, ultimately aiming to convert language data from audio to a searchable database. The thesis also presents a closer look at MCSKL, a corpus of naturally occurring spoken interactional Kazakh, which served as the data source, and the case study for the automation effort feasibility. The study argues that adapting tools designed initially for high-resource languages, such as Whisper for speech-to-text and ELAN for annotation, requires not only technical fine-tuning but also methodological adjustments. In response, this thesis proposes a scalable and modular workflow that integrates these tools and supplements them with custom Python scripts, enabling researchers to efficiently process naturally occurring, unprompted spoken language data and output it in searchable formats compatible with search engines such as Apache Solr. Methodologically, the thesis adopts a discourse-functional approach to annotation, using intonation units rather than sentence-based segmentation to more faithfully represent spoken interaction. The research draws from both theoretical insights in corpus linguistics and practical implementation within a multilingual, interdisciplinary research collaboration. It also highlights the broader implications of automating linguistic documentation for underrepresented languages in the digital realm. The findings demonstrate that while full automation remains elusive, targeted computational support for the time-intensive phases can significantly reduce annotation time, improve consistency, and empower resource-limited teams. Ultimately, this thesis contributes to sustainable corpus-building practices and reinforces the role of corpus linguistics as a discipline central to both linguistic research and the development of language technology.
dc.identifier.citation	Mikhailov, N. (2025). Automation of corpus building from transcription to search: a case study of Kazakh. Nazarbayev University School of Sciences and Humanities.
dc.identifier.uri	https://nur.nu.edu.kz/handle/123456789/8433
dc.language.iso	en
dc.publisher	Nazarbayev University School of Sciences and Humanities
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/
dc.subject	Type of access: Embargo
dc.subject	HUMANITIES and RELIGION::Languages and linguistics::Linguistic subjects::Computational linguistics
dc.subject	Corpus linguistics
dc.subject	Low-resource language
dc.subject	Kazakh language
dc.subject	Speech recognition
dc.subject	Natural language processing
dc.subject	Multimedia Corpus of Spoken Kazakh Language
dc.subject	MCSKL
dc.title	AUTOMATION OF CORPUS BUILDING FROM TRANSCRIPTION TO SEARCH: A CASE STUDY OF KAZAKH
dc.type	PhD thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Thesis repository copy.pdf
Size:: 1.47 MB
Format:: Adobe Portable Document Format
Description:: PhD thesis

Embargo until 2027-01-01

Download

Collections

01. PhD Thesis