K-means over incomplete datasets using mean euclidean distance

Loai Abdallah, Ilan Shimshoni

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Missing values in data are common in real world applications. In this research we developed a new version of the well-known k-means clustering algorithm that deals with such incomplete datasets. The k-means algorithm has two basic steps, performed at each iteration: it associates each point with its closest centroid and then it computes the new centroids. So, to run it we need a distance function and a mean computation formula. To measure the similarity between two incomplete points, we use the distribution of the incomplete attributes. We propose several directions for computing the centroids. In the first, incomplete points are dealt with as one point and the centroid is computed according to the developed formula derived in this research. In the second and the third, each incomplete point is replaced with a large number of points according to the data distribution and from these points the centroid is computed. Even so, the runtime complexity of the suggested k-means is the same as the standard k-means over complete datasets. We experimented on six standard numerical datasets from different fields and compared the performance of our proposed k-means to other basic methods. Our experiments show that our suggested k-means algorithms outperform previously published methods.

Original languageEnglish
Title of host publicationMachine Learning and Data Mining in Pattern Recognition - 12th International Conference, MLDM 2016, Proceedings
EditorsPetra Perner
PublisherSpringer Verlag
Pages113-127
Number of pages15
ISBN (Print)9783319419190
DOIs
StatePublished - 2016
Event12th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2016 - New York, United States
Duration: 16 Jul 201621 Jul 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9729
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference12th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2016
Country/TerritoryUnited States
CityNew York
Period16/07/1621/07/16

Bibliographical note

Publisher Copyright:
© Springer International Publishing Switzerland 2016.

Keywords

  • Clustering
  • Incomplete datasets
  • K-means
  • Missing values

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'K-means over incomplete datasets using mean euclidean distance'. Together they form a unique fingerprint.

Cite this