Pairwise Overlap and Misclassification in Cluster Analysis

Akynkozhayev, Birzhan

Pairwise Overlap and Misclassification in Cluster Analysis

Files

Birzhan_Akynkozhayev_capstone.pdf (733.08 KB)

Date

2015

Authors

Akynkozhayev, Birzhan

Publisher

Nazarbayev University School of Science and Technology

Abstract

Separation of data into distinct groups is one of the most important tools of learning and means of obtaining valuable information from data. Cluster analysis studies the ways of distributing objects into groups with similar characteristics. Real-world examples of such applications are age separation of a population, loyalty grouping of customers, classification of living organisms into kingdoms, etc. In particular, cluster analysis is an important objective of data mining, which focuses on studying ways of extracting key information from data and converting it into some more understandable form. There is no single best algorithm for producing data partitions in cluster analysis, but many that perform well in various circumstances (Jain, 2008). Many popular clustering algorithms are based on an iterative partitioning method, where single items are moved step-by-step from one cluster to another based on optimization of some parameter. One of such algorithms, which will be mentioned in this paper is K-means algorithm, where data points are partitioned based on optimization of sum of squared distances within clusters (MacQueen, 1967). Another large class of algorithms are based on finite mixture model clustering methods. For example, stochastic emEMclustering method, which will also be covered in this article, is based on maximum likelihood estimation of statistical model parameters (Melnykov & Maitra). Misclassification of data is not a rare situation in cluster analysis. For instance, we can observe that several points have been misclassified on the previous figure (Figure 1) of true partition (a) versus the solution found by the K-means algorithm (b). Various factors lead to misclassification in clustering algorithms. The main goal of this paper is to analyze the effect of pairwise overlap, number of dimensions of data, and number of clusters on misclassification. The simplest case where misclassification can occur is when there are two clusters. The overlap is exact in this case, thus, we proceeded to use one of the simplest algorithms – K-means. At the higher number of clusters, when overlap is estimated, we considered more complex emEM algorithm

Keywords

Research Subject Categories, Cluster Analysis

Citation

Akynkozhayev Birzhan.2015. Pairwise Overlap and Misclassification in Cluster Analysis. School of Science and Technology. Mathematics Department. http://nur.nu.edu.kz/handle/123456789/1635

URI

http://nur.nu.edu.kz/handle/123456789/1635

Collections

03. Bachelor's Thesis

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-ShareAlike 3.0 United States

Full item page

Pairwise Overlap and Misclassification in Cluster Analysis

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license