Abstract:
At the current age there is an urgent need in developing massively scalable and
efficient tools to Big Data processing. Even the smallest companies nowadays inevitably
require more and more resources for data processing routines that could enhance decision
making and reliably predict and simulate different scenarios. In the current paper we present
our combined work on different massively scalable approaches for the task of clustering and
topic modeling of the dataset, collected by crawling Kazakhstan news websites. In particular,
we propose Apache Spark parallel solutions to news clustering and topic modeling problems
and, additionally, we describe results of implementing document clustering using developed
partitioned global address space Mapreduce system. In our work we describe our experience in
solving these problems and investigate the efficiency and scalability of the proposed solutions.