DSpace Repository

BUILDING A CORPUS OF JOURNALISTIC KAZAKH LANGUAGE USING NEURAL NETWORKS FOR AUTOMATED PART-OF-SPEECH TAGGING

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

Show simple item record

dc.contributor.author Mikhailov, Nikolay
dc.date.accessioned 2021-06-07T07:34:25Z
dc.date.available 2021-06-07T07:34:25Z
dc.date.issued 2021-06
dc.identifier.citation Mikhailov, N.(2021). Building a corpus of journalistic Kazakh language using neural networks for automated part-of-speech tagging (Unpublished master`s thesis). Nazarbayev University, Nur-Sultan, Kazakhstan en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/5451
dc.description.abstract The thesis explores the status quo of the Kazakh language in terms of corpus linguistics. The project aims to survey the currently existing corpora of the Kazakh language and contribute to the existing body through a more flexible and more automatic way of corpus building and annotation. Upon the examination of the field, it was determined that while there are some efforts to digitize the Kazakh language, those projects are largely still being developed. They are conducted on various scales — from small student projects to the projects led by Mozilla and big research groups, like Apertium. Therefore, this project set out to attempt to build a corpus of journalistic Kazakh language using neural networks for part-of-speech tagging. In order to construct the corpus, news websites were used as a source, as they provide a decent vocabulary range while remaining easily accessible. The project utilized a series of small-scale Python programs to create the body of data to be annotated via obtaining the text from the web pages. The final stage of the study involves using the neural networks in order to assign the words their respective parts of speech. Neural networks provide an automatable way of doing part-of-speech tagging that is faster compared to humans, with an accuracy that can be almost equal to that of humans. In addition, while using the neural networks is a known way to approach the tagging and annotation, it has not seen use in Kazakh corpus linguistics as of yet. The final model was able to assign the correct parts of speech to words with a reasonable degree of accuracy, which could still be improved by providing a bigger sample of training data. The project can be later utilized to build a more extensive corpus with a high degree of automation, lowering the time expenses en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Sciences and Humanities en_US
dc.rights Attribution-NonCommercial-ShareAlike 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/us/ *
dc.subject Type of access: Embargo en_US
dc.subject Kazakh language en_US
dc.subject linguistics en_US
dc.title BUILDING A CORPUS OF JOURNALISTIC KAZAKH LANGUAGE USING NEURAL NETWORKS FOR AUTOMATED PART-OF-SPEECH TAGGING en_US
dc.type Master's thesis en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States