Manual vs Automatic Bitext Extraction

dc.contributor.authorMyrzakhmetov, Bagdat
dc.contributor.authorAssylbekov, Zhenisbek
dc.contributor.authorMakazhanov, Aibek
dc.date.accessioned2019-09-02T08:34:50Z
dc.date.available2019-09-02T08:34:50Z
dc.date.issued2018-05-12
dc.description.abstractWe compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution.en_US
dc.identifier.citationMakazhanov, A., Myrzakhmetov, B., & Assylbekov, Z. (2018). Manual vs Automatic Bitext Extraction. In 11th International Conference on Language Resources and Evaluation Miyazaki, Japanen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/4201
dc.language.isoenen_US
dc.publisherNazarbayev University School of Sciences and Humanitiesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectbitext extractionen_US
dc.subjectcrawlingen_US
dc.subjectsentence alignmenten_US
dc.titleManual vs Automatic Bitext Extractionen_US
dc.typeConference Paperen_US
workflow.import.sourcescience

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
++Myrzakhmetov Manual vs Bitext.pdf
Size:
367.06 KB
Format:
Adobe Portable Document Format
Description:
Paper
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections