ADDRESSES STANDARDIZATION AND GEOCODING USING NATURAL LANGUAGE PROCESSING

dc.contributor.authorMussylmanbay, Meiirgali
dc.date.accessioned2022-09-16T05:26:09Z
dc.date.available2022-09-16T05:26:09Z
dc.date.issued2022-07
dc.description.abstractGeocoding, the process of converting the textual addresses into a pair of coordinates, is a preliminary step in spatial analysis. However, converting addresses into latitude and longitude is not a trivial task as they are presented as arbitrary text, mostly lacking completeness, and do not follow a concrete fixed structure. Therefore, the thesis discusses the theoretical fundamentals of textual data normalization and standardization techniques and presents adequate practical approaches to how addresses written in various ways can be brought to a single standard. For binding the textual addresses with their appropriate geocodes, we have conducted practical experiments using the data collected from 5 publicly available sources and such tools as Elasticsearch, including its built-in BM25 similarity algorithm, as well as a state-of-the-art algorithm - BERT. Also, we have admitted Open Street Map address structure as a golden standard and cosine similarity algorithm as a text similarity algorithm. The practical outcomes of the models were verified on randomly chosen 100 records. The results were visualized on the map to illustrate the applicable cases of geocoding usage. Further, the raw address data and address standardization results serve as train and test data to predict the closest address and adequate geocodes for given arbitrary address representations. For the thesis, we used models based on Transformer architecture, namely T5 and BART, for predicting ’correct’ addresses. In addition, BLEU was used as a reference metric to compare the models’ accuracy. Overall, the thesis can boast rich theoretical background information and be a practical reference to how clean addresses can be revealed using state-of-the-art models given non-standard addresses.en_US
dc.identifier.citationMussylmanbay, M. (2022). Addresses Standardization and Geocoding using Natural Language Processing (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstanen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/6705
dc.language.isoenen_US
dc.publisherNazarbayev University School of Engineering and Digital Sciencesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectType of access: Open Accessen_US
dc.subjectResearch Subject Categories::TECHNOLOGYen_US
dc.subjectNatural Language Processingen_US
dc.subjectBLEUen_US
dc.subjectBARTen_US
dc.subjectBERTen_US
dc.subjectTransformer based Generative Modelen_US
dc.titleADDRESSES STANDARDIZATION AND GEOCODING USING NATURAL LANGUAGE PROCESSINGen_US
dc.typeMaster's thesisen_US
workflow.import.sourcescience

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Thesis - Meiirgali Mussylmanbay.pdf
Size:
2.77 MB
Format:
Adobe Portable Document Format
Description:
Thesis
No Thumbnail Available
Name:
Presentation - Meiirgali Mussylmanbay.pptx
Size:
5.39 MB
Format:
Microsoft Powerpoint XML
Description:
Presentation
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.28 KB
Format:
Item-specific license agreed upon to submission
Description: