Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

Myrzakhmetov, Bagdat; Yessenbayev, Zhandos; Makazhanov, Aibek

NUR Home
→
02.National Laboratory Astana
→
Articles
→
View Item

dc.contributor.author	Myrzakhmetov, Bagdat
dc.contributor.author	Yessenbayev, Zhandos
dc.contributor.author	Makazhanov, Aibek
dc.date.accessioned	2019-02-21T08:32:27Z
dc.date.available	2019-02-21T08:32:27Z
dc.date.issued	2018-10
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/3749
dc.description.abstract	We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy.	en_US
dc.language.iso	en	en_US
dc.publisher	The IEEE 12th International Conference Application of Information and Communication Technologies	en_US
dc.rights	Attribution 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	*
dc.subject	user generated content	en_US
dc.subject	normalization	en_US
dc.subject	code switching	en_US
dc.subject	transliteration	en_US
dc.title	Initial Normalization of User Generated Content: Case Study in a Multilingual Setting	en_US
dc.type	Article	en_US
workflow.import.source	science