Abstract:
Today’s scientific community is puzzled by searching and processing data and extracting
hidden relations and critical facts from articles. In this regard, we decided to consider
how to improve the situation. This study explored methods for obtaining metadata
from various scientific article aggregators before settling on "www.core.ac.uk".
After checking for duplicates, language, and preprocessing, the data set was reduced
from 111,552 records to 49,310 records. Stemming, standardization, and lemmatization
methods were used for data preprocessing. Our goal is to see how different the
clustering results of the models Word2Vec, FastText, Doc2Vec, and Top2Vec embeddings
are. We used the TF-IDF and K-Means clustering approach as the base model.
The critical point is that real-world data is diverse and has different densities. It
follows that preserving the local and global data structure is necessary. We used the
UMAP dimensionality reduction approach for dense and arbitrary data and the HDBSCAN
algorithm, which detects clusters based on density. Based on the test results,
we interviewed five assessors. The results are promising, but we intend to continue
research on this subject, including topic models such as LDA and BERTopic, in the
future.