Pairwise Overlap and Misclassification in Cluster Analysis

dc.contributor.authorAkynkozhayev, Birzhan
dc.date.accessioned2016-06-16T03:18:44Z
dc.date.available2016-06-16T03:18:44Z
dc.date.issued2015
dc.description.abstractSeparation of data into distinct groups is one of the most important tools of learning and means of obtaining valuable information from data. Cluster analysis studies the ways of distributing objects into groups with similar characteristics. Real-world examples of such applications are age separation of a population, loyalty grouping of customers, classification of living organisms into kingdoms, etc. In particular, cluster analysis is an important objective of data mining, which focuses on studying ways of extracting key information from data and converting it into some more understandable form. There is no single best algorithm for producing data partitions in cluster analysis, but many that perform well in various circumstances (Jain, 2008). Many popular clustering algorithms are based on an iterative partitioning method, where single items are moved step-by-step from one cluster to another based on optimization of some parameter. One of such algorithms, which will be mentioned in this paper is K-means algorithm, where data points are partitioned based on optimization of sum of squared distances within clusters (MacQueen, 1967). Another large class of algorithms are based on finite mixture model clustering methods. For example, stochastic emEMclustering method, which will also be covered in this article, is based on maximum likelihood estimation of statistical model parameters (Melnykov & Maitra). Misclassification of data is not a rare situation in cluster analysis. For instance, we can observe that several points have been misclassified on the previous figure (Figure 1) of true partition (a) versus the solution found by the K-means algorithm (b). Various factors lead to misclassification in clustering algorithms. The main goal of this paper is to analyze the effect of pairwise overlap, number of dimensions of data, and number of clusters on misclassification. The simplest case where misclassification can occur is when there are two clusters. The overlap is exact in this case, thus, we proceeded to use one of the simplest algorithms – K-means. At the higher number of clusters, when overlap is estimated, we considered more complex emEM algorithmru_RU
dc.identifier.citationAkynkozhayev Birzhan.2015. Pairwise Overlap and Misclassification in Cluster Analysis. School of Science and Technology. Mathematics Department. http://nur.nu.edu.kz/handle/123456789/1635ru_RU
dc.identifier.urihttp://nur.nu.edu.kz/handle/123456789/1635
dc.language.isoenru_RU
dc.publisherNazarbayev University School of Science and Technology
dc.rightsAttribution-NonCommercial-ShareAlike 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/*
dc.subjectResearch Subject Categoriesru_RU
dc.subjectCluster Analysisru_RU
dc.titlePairwise Overlap and Misclassification in Cluster Analysisru_RU
dc.typeCapstone Projectru_RU

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Birzhan_Akynkozhayev_capstone.pdf
Size:
733.08 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.22 KB
Format:
Item-specific license agreed upon to submission
Description: