Character-based Deep Learning Models for Token and Sentence Segmentation

dc.contributor.authorToleu, Alymzhan
dc.contributor.authorTolegen, Gulmira
dc.contributor.authorMakazhanov, Aibek
dc.contributor.editorSuleymanov, Dzhavdet
dc.contributor.editorGatiatullin, Ayrat
dc.date.accessioned2018-05-02T09:44:27Z
dc.date.available2018-05-02T09:44:27Z
dc.date.issued2017-10-21
dc.description.abstractIn this work we address the problems of sentence segmentation and tokenization. Informally the task of sentence segmentation involves splitting a given text into units that satisfy a certain definition (or a number of definitions) of a sentence. Similarly, tokenization has as its goal splitting a text into chunks that for a certain task constitute basic units of operation, e.g. words, digits, punctuation marks and other symbols for part of speech tagging. As seen from the definition, tokenization is an absolute prerequisite for virtually every natural language processing (NLP) task. Many of so called downstream NLP applications with higher level of sophistication, e.g. machine translation, additionally require sentence segmentation. Thus both of the problems that we address are the very basic steps in NLP and, as such, are widely regarded as solved problems. Indeed there is a large body of work devoted to these problems, and there is a number of popular, highly accurate off the shelf solutions for them. Nevertheless, the problems of sentence segmentation and tokenization persist, and in practice one often faces certain difficulties whenever confronted with raw text that needs to be tokenized and/or split into sentences. This happens because existing approaches, if they are unsupervised, rely heavily on hand-crafted rules and lexicons, or, if they are supervised, rely on extraction of hand-engineered features. Such systems are not easy to maintain and adapt to new domains and languages because for those one may need to revise the rules and feature definitions. In order to address the aforementioned challenges, we develop character-based deep learning models which require neither rule nor feature engineering. The only resource required is a training set, where each character is labeled with an IOB (Inside Outside Beginning) tag. Such training sets are easily attainable from existing tokenized and sentence-segmented corpora, or, in absence of those, have to be created (but the same is true for rules, lexicons, and hand-crafted features). The IOB-like annotation allows us to solve both tokenization and sentence segmentation problems simultaneously casting them as a single sequence-labeling task, where each character has to be tagged with one of four tags: beginning of a sentence (S), beginning of a token (T), inside of a token (I) and outside of a token (O). To this end we design three models based on artificial neural networks: (i) a fully connected feed forward network; (ii) long short term memory (LSTM) network; (iii) bi-directional version of LSTM. The proposed models utilize character embeddings, i.e. represent characters as vectors in a multidimensional continuous space. We evaluate our approach on three typologically distant languages, namely English, Italian, and Kazakh. In terms of evaluation metrics we use standard precision, recall, and F-measure scores, as well as combined error rate for sentence and token boundary detection. We use two state of the art supervised systems as baselines, and show that our models consistently outperform both of them in terms of error rate.en_US
dc.identifier.isbn978-5-9690-0406-1
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/3165
dc.language.isoenen_US
dc.publisherTatarstan Academy of Sciencesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectToken and Sentence Segmentation; Neural Networks; Deep Learningen_US
dc.titleCharacter-based Deep Learning Models for Token and Sentence Segmentationen_US
dc.typeConference Paperen_US
workflow.import.sourcescience

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
tl17_tokenization_proceedings.pdf
Size:
493.29 KB
Format:
Adobe Portable Document Format
Description:
main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections