DSpace Repository

Document and Word-level Language Identification for Noisy User Generated Text

Show simple item record

dc.contributor.author Kozhirbayev, Zhanibek
dc.contributor.author Yessenbayev, Zhandos
dc.contributor.author Makazhanov, Aibek
dc.date.accessioned 2018-10-22T05:49:17Z
dc.date.available 2018-10-22T05:49:17Z
dc.date.issued 2018-10
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/3549
dc.description.abstract We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes - in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language. en_US
dc.description.sponsorship This work has been supported by Nazarbayev University research grant 129-2017/022-2017 and the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272. en_US
dc.language.iso en en_US
dc.publisher The IEEE 12th International Conference Application of Information and Communication Technologies en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject language identification, code-switching, user generated content, normalization en_US
dc.title Document and Word-level Language Identification for Noisy User Generated Text en_US
dc.type Conference Paper en_US
workflow.import.source science

Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Video Guide

Submission guideSubmission guide

Submit your materials for publication to

NU Repository Drive


My Account