USING NATURAL LANGUAGE PROCESSING TO THE HUMANITIES: EXPLORING CLIMATE CHANGE IN THE SCIENTIFIC LITERATURE

dc.contributor.authorBerdimbetov, Dossym
dc.date.accessioned2022-09-16T05:14:05Z
dc.date.available2022-09-16T05:14:05Z
dc.date.issued2022-07
dc.description.abstractToday’s scientific community is puzzled by searching and processing data and extracting hidden relations and critical facts from articles. In this regard, we decided to consider how to improve the situation. This study explored methods for obtaining metadata from various scientific article aggregators before settling on "www.core.ac.uk". After checking for duplicates, language, and preprocessing, the data set was reduced from 111,552 records to 49,310 records. Stemming, standardization, and lemmatization methods were used for data preprocessing. Our goal is to see how different the clustering results of the models Word2Vec, FastText, Doc2Vec, and Top2Vec embeddings are. We used the TF-IDF and K-Means clustering approach as the base model. The critical point is that real-world data is diverse and has different densities. It follows that preserving the local and global data structure is necessary. We used the UMAP dimensionality reduction approach for dense and arbitrary data and the HDBSCAN algorithm, which detects clusters based on density. Based on the test results, we interviewed five assessors. The results are promising, but we intend to continue research on this subject, including topic models such as LDA and BERTopic, in the future.en_US
dc.identifier.citationBerdimbetov, D. (2022). Using Natural Language Processing to the Humanities: Exploring Climate Change in the Scientific Literature (Unpublished master's thesis). Nazarbayev University, Nur-Sultan, Kazakhstanen_US
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/6702
dc.language.isoenen_US
dc.publisherNazarbayev University School of Engineering and Digital Sciencesen_US
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjecttype of access: gated accessen_US
dc.subjectResearch Subject Categories::TECHNOLOGYen_US
dc.subjectNatural Language Processingen_US
dc.subjectClimate Changeen_US
dc.subjectScientific Literatureen_US
dc.titleUSING NATURAL LANGUAGE PROCESSING TO THE HUMANITIES: EXPLORING CLIMATE CHANGE IN THE SCIENTIFIC LITERATUREen_US
dc.typeMaster's thesisen_US
workflow.import.sourcescience

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Presentation - Dossym Berdimbetov.pptx
Size:
5.16 MB
Format:
Microsoft Powerpoint XML
Description:
Presentation
Loading...
Thumbnail Image
Name:
Thesis - Dossym Berdimbetov.pdf
Size:
5.12 MB
Format:
Adobe Portable Document Format
Description:
Thesis