Abstract
Missing values in data are common in real world applications. In this research we developed a new version of the well-known k-means clustering algorithm that deals with such incomplete datasets. The k-means algorithm has two basic steps, performed at each iteration: it associates each point with its closest centroid and then it computes the new centroids. So, to run it we need a distance function and a mean computation formula. To measure the similarity between two incomplete points, we use the distribution of the incomplete attributes. We propose several directions for computing the centroids. In the first, incomplete points are dealt with as one point and the centroid is computed according to the developed formula derived in this research. In the second and the third, each incomplete point is replaced with a large number of points according to the data distribution and from these points the centroid is computed. Even so, the runtime complexity of the suggested k-means is the same as the standard k-means over complete datasets. We experimented on six standard numerical datasets from different fields and compared the performance of our proposed k-means to other basic methods. Our experiments show that our suggested k-means algorithms outperform previously published methods.
Original language | English |
---|---|
Title of host publication | Machine Learning and Data Mining in Pattern Recognition - 12th International Conference, MLDM 2016, Proceedings |
Editors | Petra Perner |
Publisher | Springer Verlag |
Pages | 113-127 |
Number of pages | 15 |
ISBN (Print) | 9783319419190 |
DOIs | |
State | Published - 2016 |
Event | 12th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2016 - New York, United States Duration: 16 Jul 2016 → 21 Jul 2016 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 9729 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 12th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2016 |
---|---|
Country/Territory | United States |
City | New York |
Period | 16/07/16 → 21/07/16 |
Bibliographical note
Publisher Copyright:© Springer International Publishing Switzerland 2016.
Keywords
- Clustering
- Incomplete datasets
- K-means
- Missing values
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science