Document and Word-level Language Identification for Noisy User Generated Text

Kozhirbayev, ZhanibekYessenbayev, ZhandosMakazhanov, AibekDocument and Word-level Language Identification for Noisy User Generated TextThe IEEE 12th International Conference Application of Information and Communication Technologies2018language identification, code-switching, user generated content, normalizationMy UniversityMy University2018-10-222018-10-222018-10enConference Paperhttp://nur.nu.edu.kz/handle/123456789/3549Attribution-NonCommercial-NoDerivs 3.0 United Stateshttp://creativecommons.org/licenses/by-nc-nd/3.0/us/We present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes - in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.