DSpace Repository


Show simple item record

dc.contributor.author Mussylmanbay, Meiirgali
dc.date.accessioned 2022-09-16T05:26:09Z
dc.date.available 2022-09-16T05:26:09Z
dc.date.issued 2022-07
dc.identifier.citation Mussylmanbay, M. (2022). Addresses Standardization and Geocoding using Natural Language Processing (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstan en_US
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/6705
dc.description.abstract Geocoding, the process of converting the textual addresses into a pair of coordinates, is a preliminary step in spatial analysis. However, converting addresses into latitude and longitude is not a trivial task as they are presented as arbitrary text, mostly lacking completeness, and do not follow a concrete fixed structure. Therefore, the thesis discusses the theoretical fundamentals of textual data normalization and standardization techniques and presents adequate practical approaches to how addresses written in various ways can be brought to a single standard. For binding the textual addresses with their appropriate geocodes, we have conducted practical experiments using the data collected from 5 publicly available sources and such tools as Elasticsearch, including its built-in BM25 similarity algorithm, as well as a state-of-the-art algorithm - BERT. Also, we have admitted Open Street Map address structure as a golden standard and cosine similarity algorithm as a text similarity algorithm. The practical outcomes of the models were verified on randomly chosen 100 records. The results were visualized on the map to illustrate the applicable cases of geocoding usage. Further, the raw address data and address standardization results serve as train and test data to predict the closest address and adequate geocodes for given arbitrary address representations. For the thesis, we used models based on Transformer architecture, namely T5 and BART, for predicting ’correct’ addresses. In addition, BLEU was used as a reference metric to compare the models’ accuracy. Overall, the thesis can boast rich theoretical background information and be a practical reference to how clean addresses can be revealed using state-of-the-art models given non-standard addresses. en_US
dc.language.iso en en_US
dc.publisher Nazarbayev University School of Engineering and Digital Sciences en_US
dc.rights Attribution-NonCommercial-ShareAlike 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/us/ *
dc.subject Type of access: Open Access en_US
dc.subject Research Subject Categories::TECHNOLOGY en_US
dc.subject Natural Language Processing en_US
dc.subject BLEU en_US
dc.subject BART en_US
dc.subject BERT en_US
dc.subject Transformer based Generative Model en_US
dc.type Master's thesis en_US
workflow.import.source science

Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States

Video Guide

Submission guideSubmission guide

Submit your materials for publication to

NU Repository Drive


My Account