Research Institute of computer sciencehttp://nur.nu.edu.kz:80/handle/123456789/11212024-03-28T18:09:25Z2024-03-28T18:09:25ZNamed Entity Recognition for Kazakh Using Conditional Random Fields / Извлечение именованных сущностей из текста на Казахском языке с использованием условных случайных полейGulmira, TolegenAlymzhan, ToleuZheng, Xiaoqinghttp://nur.nu.edu.kz:80/handle/123456789/22342018-08-15T03:50:06Z2016-01-01T00:00:00ZNamed Entity Recognition for Kazakh Using Conditional Random Fields / Извлечение именованных сущностей из текста на Казахском языке с использованием условных случайных полей
Gulmira, Tolegen; Alymzhan, Toleu; Zheng, Xiaoqing
We addressed the Named Entity Recognition (NER) problem for the Kazakh language by using conditional random fields. Kazakh is a typical agglutinative language in which thousands of words could be generated by adding prefixes and suffixes to the same root, which arises a serious data sparsity problem for many NLP tasks. To reduce the data sparsity problem, a necessary preprocessing step is to split the words into their roots and morphemes by morphological analysis. In this study, we designed a CRF-based NER system for Kazakh, which leveraged the features derived from the results of a new-developed morphological analyzer, and found that the performance can be boosted by introducing such derived features. Moreover, we assembled a NER corpus which was manually annotated with location, organization and person names.
2016-01-01T00:00:00ZInitial Experiments on Russian to Kazakh SMTMyrzakhmetov, BagdatMakazhanov, Aibekhttp://nur.nu.edu.kz:80/handle/123456789/22332018-08-15T03:50:04Z2016-01-01T00:00:00ZInitial Experiments on Russian to Kazakh SMT
Myrzakhmetov, Bagdat; Makazhanov, Aibek
We present our initial experiments on Russian to Kazakh phrase-based
statistical machine translation. Following a common approach to SMT between
morphologically rich languages, we employ morphological processing techniques.
Namely, for our initial experiments, we perform source-side lemmatization. Given
a rather humble-sized parallel corpus at hand, we also put some effort in data
cleaning and investigate the impact of data quality vs. quantity trade off on the
overall performance. Although our experiments mostly focus on source side preprocessing we achieve a substantial, statistically significant improvement over the
baseline that operates on raw, unprocessed data.
2016-01-01T00:00:00Z