Experiments with Russian to Kazakh sentence alignment

Loading...
Thumbnail Image

Date

2016

Authors

Assylbekov, Zhenisbek
Myrzakhmetov, Bagdat
Makazhanov, Aibek

Journal Title

Journal ISSN

Volume Title

Publisher

The 4-th International Conference on Computer Processing of Turkic Languages “TurkLang 2016”

Abstract

Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences.

Description

Keywords

sentence alignment, sentence splitting, lemmatization, parallel corpus, Kazakh language, выравнивание по предложениям, разбивка по предложениям, лемматизация, параллельный корпус, казахский язык, Research Subject Categories::MATHEMATICS

Citation

Zhenisbek Assylbekov , Bagdat Myrzakhmetov and Aibek Makazhanov (2016) Experiments with Russian to Kazakh sentence alignment. The 4-th International Conference on Computer Processing of Turkic Languages “TurkLang 2016”.