DSpace Repository

Pairwise Overlap and Misclassification in Cluster Analysis

Show simple item record

dc.contributor.author Akynkozhayev, Birzhan
dc.date.accessioned 2016-06-16T03:18:44Z
dc.date.available 2016-06-16T03:18:44Z
dc.date.issued 2015
dc.identifier.citation Akynkozhayev Birzhan.2015. Pairwise Overlap and Misclassification in Cluster Analysis. School of Science and Technology. Mathematics Department. http://nur.nu.edu.kz/handle/123456789/1635 ru_RU
dc.identifier.uri http://nur.nu.edu.kz/handle/123456789/1635
dc.description.abstract Separation of data into distinct groups is one of the most important tools of learning and means of obtaining valuable information from data. Cluster analysis studies the ways of distributing objects into groups with similar characteristics. Real-world examples of such applications are age separation of a population, loyalty grouping of customers, classification of living organisms into kingdoms, etc. In particular, cluster analysis is an important objective of data mining, which focuses on studying ways of extracting key information from data and converting it into some more understandable form. There is no single best algorithm for producing data partitions in cluster analysis, but many that perform well in various circumstances (Jain, 2008). Many popular clustering algorithms are based on an iterative partitioning method, where single items are moved step-by-step from one cluster to another based on optimization of some parameter. One of such algorithms, which will be mentioned in this paper is K-means algorithm, where data points are partitioned based on optimization of sum of squared distances within clusters (MacQueen, 1967). Another large class of algorithms are based on finite mixture model clustering methods. For example, stochastic emEMclustering method, which will also be covered in this article, is based on maximum likelihood estimation of statistical model parameters (Melnykov & Maitra). Misclassification of data is not a rare situation in cluster analysis. For instance, we can observe that several points have been misclassified on the previous figure (Figure 1) of true partition (a) versus the solution found by the K-means algorithm (b). Various factors lead to misclassification in clustering algorithms. The main goal of this paper is to analyze the effect of pairwise overlap, number of dimensions of data, and number of clusters on misclassification. The simplest case where misclassification can occur is when there are two clusters. The overlap is exact in this case, thus, we proceeded to use one of the simplest algorithms – K-means. At the higher number of clusters, when overlap is estimated, we considered more complex emEM algorithm ru_RU
dc.language.iso en ru_RU
dc.publisher Nazarbayev University School of Science and Technology
dc.rights Attribution-NonCommercial-ShareAlike 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/us/ *
dc.subject Research Subject Categories ru_RU
dc.subject Cluster Analysis ru_RU
dc.title Pairwise Overlap and Misclassification in Cluster Analysis ru_RU
dc.type Capstone Project ru_RU


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-ShareAlike 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States

Video Guide

Submission guideSubmission guide

Submit your materials for publication to

NU Repository Drive

Browse

My Account

Statistics