Character-based Deep Learning Models for Token and Sentence Segmentation

Toleu, Alymzhan; Tolegen, Gulmira; Makazhanov, Aibek

NUR Home
→
02.National Laboratory Astana
→
Articles
→
View Item

dc.contributor.author	Toleu, Alymzhan
dc.contributor.author	Tolegen, Gulmira
dc.contributor.author	Makazhanov, Aibek
dc.contributor.editor	Suleymanov, Dzhavdet
dc.contributor.editor	Gatiatullin, Ayrat
dc.date.accessioned	2018-05-02T09:44:27Z
dc.date.available	2018-05-02T09:44:27Z
dc.date.issued	2017-10-21
dc.identifier.isbn	978-5-9690-0406-1
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/3165
dc.description.abstract	In this work we address the problems of sentence segmentation and tokenization. Informally the task of sentence segmentation involves splitting a given text into units that satisfy a certain definition (or a number of definitions) of a sentence. Similarly, tokenization has as its goal splitting a text into chunks that for a certain task constitute basic units of operation, e.g. words, digits, punctuation marks and other symbols for part of speech tagging. As seen from the definition, tokenization is an absolute prerequisite for virtually every natural language processing (NLP) task. Many of so called downstream NLP applications with higher level of sophistication, e.g. machine translation, additionally require sentence segmentation. Thus both of the problems that we address are the very basic steps in NLP and, as such, are widely regarded as solved problems. Indeed there is a large body of work devoted to these problems, and there is a number of popular, highly accurate off the shelf solutions for them. Nevertheless, the problems of sentence segmentation and tokenization persist, and in practice one often faces certain difficulties whenever confronted with raw text that needs to be tokenized and/or split into sentences. This happens because existing approaches, if they are unsupervised, rely heavily on hand-crafted rules and lexicons, or, if they are supervised, rely on extraction of hand-engineered features. Such systems are not easy to maintain and adapt to new domains and languages because for those one may need to revise the rules and feature definitions. In order to address the aforementioned challenges, we develop character-based deep learning models which require neither rule nor feature engineering. The only resource required is a training set, where each character is labeled with an IOB (Inside Outside Beginning) tag. Such training sets are easily attainable from existing tokenized and sentence-segmented corpora, or, in absence of those, have to be created (but the same is true for rules, lexicons, and hand-crafted features). The IOB-like annotation allows us to solve both tokenization and sentence segmentation problems simultaneously casting them as a single sequence-labeling task, where each character has to be tagged with one of four tags: beginning of a sentence (S), beginning of a token (T), inside of a token (I) and outside of a token (O). To this end we design three models based on artificial neural networks: (i) a fully connected feed forward network; (ii) long short term memory (LSTM) network; (iii) bi-directional version of LSTM. The proposed models utilize character embeddings, i.e. represent characters as vectors in a multidimensional continuous space. We evaluate our approach on three typologically distant languages, namely English, Italian, and Kazakh. In terms of evaluation metrics we use standard precision, recall, and F-measure scores, as well as combined error rate for sentence and token boundary detection. We use two state of the art supervised systems as baselines, and show that our models consistently outperform both of them in terms of error rate.	en_US
dc.language.iso	en	en_US
dc.publisher	Tatarstan Academy of Sciences	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	*
dc.subject	Token and Sentence Segmentation; Neural Networks; Deep Learning	en_US
dc.title	Character-based Deep Learning Models for Token and Sentence Segmentation	en_US
dc.type	Conference Paper	en_US
workflow.import.source	science