Pairwise Overlap and Misclassification in Cluster Analysis

Akynkozhayev, Birzhan

NUR Home
→
01.NU Schools
→
School of Science and Technology (2015-2019)
→
Mathematics
→
Capstone Projects
→
View Item

Система будет остановлена для регулярного обслуживания. Пожалуйста, сохраните рабочие данные и выйдите из системы.

dc.contributor.author	Akynkozhayev, Birzhan
dc.date.accessioned	2016-06-16T03:18:44Z
dc.date.available	2016-06-16T03:18:44Z
dc.date.issued	2015
dc.identifier.citation	Akynkozhayev Birzhan.2015. Pairwise Overlap and Misclassification in Cluster Analysis. School of Science and Technology. Mathematics Department. http://nur.nu.edu.kz/handle/123456789/1635	ru_RU
dc.identifier.uri	http://nur.nu.edu.kz/handle/123456789/1635
dc.description.abstract	Separation of data into distinct groups is one of the most important tools of learning and means of obtaining valuable information from data. Cluster analysis studies the ways of distributing objects into groups with similar characteristics. Real-world examples of such applications are age separation of a population, loyalty grouping of customers, classification of living organisms into kingdoms, etc. In particular, cluster analysis is an important objective of data mining, which focuses on studying ways of extracting key information from data and converting it into some more understandable form. There is no single best algorithm for producing data partitions in cluster analysis, but many that perform well in various circumstances (Jain, 2008). Many popular clustering algorithms are based on an iterative partitioning method, where single items are moved step-by-step from one cluster to another based on optimization of some parameter. One of such algorithms, which will be mentioned in this paper is K-means algorithm, where data points are partitioned based on optimization of sum of squared distances within clusters (MacQueen, 1967). Another large class of algorithms are based on finite mixture model clustering methods. For example, stochastic emEMclustering method, which will also be covered in this article, is based on maximum likelihood estimation of statistical model parameters (Melnykov & Maitra). Misclassification of data is not a rare situation in cluster analysis. For instance, we can observe that several points have been misclassified on the previous figure (Figure 1) of true partition (a) versus the solution found by the K-means algorithm (b). Various factors lead to misclassification in clustering algorithms. The main goal of this paper is to analyze the effect of pairwise overlap, number of dimensions of data, and number of clusters on misclassification. The simplest case where misclassification can occur is when there are two clusters. The overlap is exact in this case, thus, we proceeded to use one of the simplest algorithms – K-means. At the higher number of clusters, when overlap is estimated, we considered more complex emEM algorithm	ru_RU
dc.language.iso	en	ru_RU
dc.publisher	Nazarbayev University School of Science and Technology
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	*
dc.subject	Research Subject Categories	ru_RU
dc.subject	Cluster Analysis	ru_RU
dc.title	Pairwise Overlap and Misclassification in Cluster Analysis	ru_RU
dc.type	Capstone Project	ru_RU