An Assessment of Universal Dependency Annotation Guidelines for Turkic Languages

dc.contributor.authorTyers, Francis
dc.contributor.authorWashington, Jonathan
dc.contributor.authorÇöltekin, Çağrı
dc.contributor.authorMakazhanov, Aibek
dc.contributor.editorSuleymanov, Dzhavdet
dc.contributor.editorGatiatullin, Ayrat
dc.date.accessioned2018-05-02T09:47:13Z
dc.date.available2018-05-02T09:47:13Z
dc.date.issued2017-10-21
dc.description.abstractAnnotated corpora of three Turkic languages – Turkish, Kazakh, and Uyghur – were released as part of version 2 of the Free/Open-Source Universal Dependencies (UD) syntactic and morphological annotation guidelines. The objective of these guidelines is to provide consistent dependency annotation to facilitate cross-linguistic comparison. This paper presents the current state of each of the three UD-annotated Turkic corpora, along with an evaluation of the performance of parsers trained on these corpora. Overall, the UD annotation guidelines for Turkish, Kazakh, and Uyghur are fairly compatible – a testament to the careful design of the guidelines. However, the specific annotation guidelines for each of these languages were developed mostly independently; because of this, differences between the three standards exist. Moving forward with Turkic annotation standards in UD, attempts will be made to reconcile the differences. These differences are overviewed in this paper. Furthermore, a number of issues in annotation have arisen and have yet to be resolved. Some of these issues require further investigation of the phenomena, and some require consultation within the UD community to determine whether solutions may be determined based on similar phenomena in other languages. A number of these open issues are discussed, including tokenisation (how to deal with words that include an orthographic space, or multiple words Annotated corpora of three Turkic languages – Turkish, Kazakh, and Uyghur – were released as part of version 2 of the Free/Open-Source Universal Dependencies (UD) syntactic and morphological annotation guidelines. The objective of these guidelines is to provide consistent dependency annotation to facilitate cross-linguistic comparison. This paper presents the current state of each of the three UD-annotated Turkic corpora, along with an evaluation of the performance of parsers trained on these corpora. Overall, the UD annotation guidelines for Turkish, Kazakh, and Uyghur are fairly compatible – a testament to the careful design of the guidelines. However, the specific annotation guidelines for each of these languages were developed mostly independently; because of this, differences between the three standards exist. Moving forward with Turkic annotation standards in UD, attempts will be made to reconcile the differences. These differences are overviewed in this paper. Furthermore, a number of issues in annotation have arisen and have yet to be resolved. Some of these issues require further investigation of the phenomena, and some require consultation within the UD community to determine whether solutions may be determined based on similar phenomena in other languages. A number of these open issues are discussed, including tokenisation (how to deal with words that include an orthographic space, or multiple words that do not include an orthographic space), the difference between core and oblique arguments of verbs, complex predicates (including structures where there is a combination of a non-finite form which governs argument structure and contributes to TAM and a finite-form which contributes to TAM and takes person agreement), multiple derivation (multiple causative or causative–passive combinations), and use of copulas instead of auxiliaries in what appear to be auxiliary constructions.en_US
dc.identifier.isbn978-5-9690-0406-1
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/3168
dc.language.isoenen_US
dc.publisherTatarstan Academy of Sciencesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectTurkish; Kazakh; Uyghur; treebank; dependency grammar; Universal Dependenciesen_US
dc.titleAn Assessment of Universal Dependency Annotation Guidelines for Turkic Languagesen_US
dc.typeConference Paperen_US
workflow.import.sourcescience

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
tl17_ud_proceedings.pdf
Size:
829.74 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections