DSpace Repository

Manual vs Automatic Bitext Extraction

Show simple item record

dc.contributor.author Myrzakhmetov, Bagdat
dc.contributor.author Assylbekov, Zhenisbek
dc.contributor.author Makazhanov, Aibek
dc.date.accessioned 2019-09-02T08:34:50Z
dc.date.available 2019-09-02T08:34:50Z
dc.date.issued 2018-05-12
dc.identifier.citation Makazhanov, A., Myrzakhmetov, B., & Assylbekov, Z. (2018). Manual vs Automatic Bitext Extraction. In 11th International Conference on Language Resources and Evaluation Miyazaki, Japan en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/4201
dc.description.abstract We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution. en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University, School of Sciences and Humanities en_US
dc.rights Attribution-NonCommercial-ShareAlike 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/us/ *
dc.subject bitext extraction en_US
dc.subject crawling en_US
dc.subject sentence alignment en_US
dc.title Manual vs Automatic Bitext Extraction en_US
dc.type Conference Paper en_US
workflow.import.source science


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States