OPTIMIZATION OF SMALL LANGUAGE MODEL FOR KAZAKH LANGUAGE

Makulbekova, Ayazhan; Zhunisbayev, Murat; Temirkhan, Yerkebulan; Aitkozhin, Aksultan; Japarova, Fatima

OPTIMIZATION OF SMALL LANGUAGE MODEL FOR KAZAKH LANGUAGE

dc.contributor.author	Makulbekova, Ayazhan
dc.contributor.author	Zhunisbayev, Murat
dc.contributor.author	Temirkhan, Yerkebulan
dc.contributor.author	Aitkozhin, Aksultan
dc.contributor.author	Japarova, Fatima
dc.date.accessioned	2025-06-13T06:53:39Z
dc.date.available	2025-06-13T06:53:39Z
dc.date.issued	2025
dc.description.abstract	Developing effective natural language processing tools for Kazakh presents unique computational linguistic challenges. As an agglutinative Turkic language, Kazakh's complex morphological structure, characterized by extensive suffixation and inflection, requires specialized handling compared to Indo-European languages. Most existing NLP models struggle with Kazakh due to limited training data and inefficient tokenization approaches that fail to properly segment its long, morphologically-rich words. Existing models often rely on translated datasets, which fail to capture linguistic nuances and cultural context, resulting in poor performance for instruction-following tasks such as translation, tool use, and open-ended dialogue. This project addresses these challenges by developing and evaluating an optimized language model specifically designed for Kazakh's linguistic characteristics. The project pursued four primary technical objectives: 1. Identify and implement the most effective tokenization strategy for handling Kazakh's agglutinative morphology. 2. Build a Kazakh NLP model supporting: instruction-following capabilities for diverse Kazakh-language tasks including translation, tool use, and conversational applications. 3. Optimize for computational efficiency using: ○ Parameter-efficient fine-tuning. ○ 4-bit quantization to reduce hardware demands. 4. Create a reproducible pipeline for low-resource language adaptation These objectives were designed to bridge the gap between theoretical language technology research and practical, deployable solutions for Kazakh language processing. Our approach focused on developing an efficient Kazakh language model by combining careful evaluation of tools and datasets with optimization techniques that allow for effective training in low-resource environments. The process consisted of five main parts: 1. Tokenizer Evaluation and Selection: We started by comparing two tokenizers — BERT and Gemma - to see how well they handled Kazakh text. The Gemma tokenizer demonstrated superior handling of Kazakh’s Cyrillic script, significantly reducing unknown tokens compared to BERT. However, it was more computationally demanding, which we noted as an area for later optimization. 2. Model Selection: We chose the Gemma-3 model with 4 billion parameters as a solid middle ground between performance and hardware efficiency. To make it more lightweight, we applied 4-bit quantization using the BitsAndBytes library, which helped lower memory usage without sacrificing too much accuracy. For fine-tuning, we used Low-Rank Adaptation (LoRA) through the Unsloth library. This lets us fine-tune the model efficiently on a single A100 40GB GPU. 3. Data Collection and Preparation: Since high-quality Kazakh data is limited, we combined a variety of sources to create a diverse training set. These included: ● The ner-kazakh dataset for named entity recognition tasks. ● A translated version of MMLU to cover general knowledge. ● Cultural datasets (Dastur, Constitutional Law). ● The Kazakh portion of the MURI-IT dataset and a machine-translated version of the Alpaca dataset for instruction tuning. 4. Training and Fine-Tuning: We employed parameter-efficient fine-tuning (PEFT) to minimize computational overhead. Also, we Used LoRA adapters to update only critical weight matrices, reducing trainable parameters. We experimented with batch sizes and learning rates to find a stable and fast training configuration.
dc.identifier.citation	Makulbekova, A., Zhunisbayev, M., Temirkhan, Ye., Aitkozhin, A., Japarova, F. (2025). Optimization of small language model for Kazakh language. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.uri	https://nur.nu.edu.kz/handle/123456789/8939
dc.language.iso	en
dc.publisher	Nazarbayev University School of Engineering and Digital Sciences
dc.rights	CC0 1.0 Universal	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/
dc.subject	natural language processing (NLP)
dc.subject	Kazakh language
dc.subject	agglutinative languages
dc.subject	Turkic languages
dc.subject	morphological analysis
dc.subject	suffixation and inflection
dc.subject	tokenization challenges
dc.subject	low-resource languages
dc.subject	language model optimization
dc.subject	instruction-following tasks
dc.subject	cultural context in NLP
dc.subject	translation
dc.subject	morphologically rich languages
dc.subject	HUMANITIES and RELIGION::Languages and linguistics::Linguistic subjects::Computational linguistics
dc.subject	type of access: open access
dc.title	OPTIMIZATION OF SMALL LANGUAGE MODEL FOR KAZAKH LANGUAGE
dc.type	Bachelor's Capstone project

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Senior Project Final Report.pdf
Size:: 6.97 MB
Format:: Adobe Portable Document Format
Description:: Bachelor's Capstone project

Download

Collections

03. Bachelor's Thesis