OPTIMIZATION OF SMALL LANGUAGE MODEL FOR KAZAKH LANGUAGE

dc.contributor.authorMakulbekova, Ayazhan
dc.contributor.authorZhunisbayev, Murat
dc.contributor.authorTemirkhan, Yerkebulan
dc.contributor.authorAitkozhin, Aksultan
dc.contributor.authorJaparova, Fatima
dc.date.accessioned2025-06-13T06:53:39Z
dc.date.available2025-06-13T06:53:39Z
dc.date.issued2025
dc.description.abstractDeveloping effective natural language processing tools for Kazakh presents unique computational linguistic challenges. As an agglutinative Turkic language, Kazakh's complex morphological structure, characterized by extensive suffixation and inflection, requires specialized handling compared to Indo-European languages. Most existing NLP models struggle with Kazakh due to limited training data and inefficient tokenization approaches that fail to properly segment its long, morphologically-rich words. Existing models often rely on translated datasets, which fail to capture linguistic nuances and cultural context, resulting in poor performance for instruction-following tasks such as translation, tool use, and open-ended dialogue. This project addresses these challenges by developing and evaluating an optimized language model specifically designed for Kazakh's linguistic characteristics. The project pursued four primary technical objectives: 1. Identify and implement the most effective tokenization strategy for handling Kazakh's agglutinative morphology. 2. Build a Kazakh NLP model supporting: instruction-following capabilities for diverse Kazakh-language tasks including translation, tool use, and conversational applications. 3. Optimize for computational efficiency using: ○ Parameter-efficient fine-tuning. ○ 4-bit quantization to reduce hardware demands. 4. Create a reproducible pipeline for low-resource language adaptation These objectives were designed to bridge the gap between theoretical language technology research and practical, deployable solutions for Kazakh language processing. Our approach focused on developing an efficient Kazakh language model by combining careful evaluation of tools and datasets with optimization techniques that allow for effective training in low-resource environments. The process consisted of five main parts: 1. Tokenizer Evaluation and Selection: We started by comparing two tokenizers — BERT and Gemma - to see how well they handled Kazakh text. The Gemma tokenizer demonstrated superior handling of Kazakh’s Cyrillic script, significantly reducing unknown tokens compared to BERT. However, it was more computationally demanding, which we noted as an area for later optimization. 2. Model Selection: We chose the Gemma-3 model with 4 billion parameters as a solid middle ground between performance and hardware efficiency. To make it more lightweight, we applied 4-bit quantization using the BitsAndBytes library, which helped lower memory usage without sacrificing too much accuracy. For fine-tuning, we used Low-Rank Adaptation (LoRA) through the Unsloth library. This lets us fine-tune the model efficiently on a single A100 40GB GPU. 3. Data Collection and Preparation: Since high-quality Kazakh data is limited, we combined a variety of sources to create a diverse training set. These included: ● The ner-kazakh dataset for named entity recognition tasks. ● A translated version of MMLU to cover general knowledge. ● Cultural datasets (Dastur, Constitutional Law). ● The Kazakh portion of the MURI-IT dataset and a machine-translated version of the Alpaca dataset for instruction tuning. 4. Training and Fine-Tuning: We employed parameter-efficient fine-tuning (PEFT) to minimize computational overhead. Also, we Used LoRA adapters to update only critical weight matrices, reducing trainable parameters. We experimented with batch sizes and learning rates to find a stable and fast training configuration.
dc.identifier.citationMakulbekova, A., Zhunisbayev, M., Temirkhan, Ye., Aitkozhin, A., Japarova, F. (2025). Optimization of small language model for Kazakh language. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/8939
dc.language.isoen
dc.publisherNazarbayev University School of Engineering and Digital Sciences
dc.rightsCC0 1.0 Universalen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/
dc.subjectnatural language processing (NLP)
dc.subjectKazakh language
dc.subjectagglutinative languages
dc.subjectTurkic languages
dc.subjectmorphological analysis
dc.subjectsuffixation and inflection
dc.subjecttokenization challenges
dc.subjectlow-resource languages
dc.subjectlanguage model optimization
dc.subjectinstruction-following tasks
dc.subjectcultural context in NLP
dc.subjecttranslation
dc.subjectmorphologically rich languages
dc.subjectHUMANITIES and RELIGION::Languages and linguistics::Linguistic subjects::Computational linguistics
dc.subjecttype of access: open access
dc.titleOPTIMIZATION OF SMALL LANGUAGE MODEL FOR KAZAKH LANGUAGE
dc.typeBachelor's Capstone project

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Senior Project Final Report.pdf
Size:
6.97 MB
Format:
Adobe Portable Document Format
Description:
Bachelor's Capstone project