Document and Word-level Language Identification for Noisy User Generated Text

dc.contributor.authorKozhirbayev, Zhanibek
dc.contributor.authorYessenbayev, Zhandos
dc.contributor.authorMakazhanov, Aibek
dc.date.accessioned2018-10-22T05:49:17Z
dc.date.available2018-10-22T05:49:17Z
dc.date.issued2018-10
dc.description.abstractWe present herein our work on language identification applied to comments left by the readers of online news sites popular in Kazakhstan. Such comments are typically written in one of the two languages spoken widely in the area (Kazakh and Russian) and sometimes - in a mixture of both. Code-switching (mixing languages) makes it desirable to identify language not only on document, but also on individual word level. We approach both tasks in a single two-step framework, performing unsupervised normalization and Nave Bayes text classification procedures successively. Moreover, we applied deep learning model based on recurrent networks with LSTM cell in order to classify text. Our results suggest improvement over the state-of-the-art for Kazakh language.en_US
dc.description.sponsorshipThis work has been supported by Nazarbayev University research grant 129-2017/022-2017 and the Committee of Science of the Ministry of Education and Science of the Republic of Kazakhstan under the research grant AP05134272.en_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/3549
dc.language.isoenen_US
dc.publisherThe IEEE 12th International Conference Application of Information and Communication Technologiesen_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectlanguage identification, code-switching, user generated content, normalizationen_US
dc.titleDocument and Word-level Language Identification for Noisy User Generated Texten_US
dc.typeConference Paperen_US
workflow.import.sourcescience

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AICT2018_extracted.pdf
Size:
707.24 KB
Format:
Adobe Portable Document Format
Description:
Main article

Collections