Mussylmanbay, Meiirgali2022-09-162022-09-162022-07Mussylmanbay, M. (2022). Addresses Standardization and Geocoding using Natural Language Processing (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstanhttp://nur.nu.edu.kz/handle/123456789/6705Geocoding, the process of converting the textual addresses into a pair of coordinates, is a preliminary step in spatial analysis. However, converting addresses into latitude and longitude is not a trivial task as they are presented as arbitrary text, mostly lacking completeness, and do not follow a concrete fixed structure. Therefore, the thesis discusses the theoretical fundamentals of textual data normalization and standardization techniques and presents adequate practical approaches to how addresses written in various ways can be brought to a single standard. For binding the textual addresses with their appropriate geocodes, we have conducted practical experiments using the data collected from 5 publicly available sources and such tools as Elasticsearch, including its built-in BM25 similarity algorithm, as well as a state-of-the-art algorithm - BERT. Also, we have admitted Open Street Map address structure as a golden standard and cosine similarity algorithm as a text similarity algorithm. The practical outcomes of the models were verified on randomly chosen 100 records. The results were visualized on the map to illustrate the applicable cases of geocoding usage. Further, the raw address data and address standardization results serve as train and test data to predict the closest address and adequate geocodes for given arbitrary address representations. For the thesis, we used models based on Transformer architecture, namely T5 and BART, for predicting ’correct’ addresses. In addition, BLEU was used as a reference metric to compare the models’ accuracy. Overall, the thesis can boast rich theoretical background information and be a practical reference to how clean addresses can be revealed using state-of-the-art models given non-standard addresses.enAttribution-NonCommercial-ShareAlike 3.0 United StatesType of access: Open AccessResearch Subject Categories::TECHNOLOGYNatural Language ProcessingBLEUBARTBERTTransformer based Generative ModelADDRESSES STANDARDIZATION AND GEOCODING USING NATURAL LANGUAGE PROCESSINGMaster's thesis