K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centres (centroids) and tends to be stuck in a local sub-optimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an im-proved version of the K-means -DPKM- based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering prob-lems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique bor-rowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multi-core ma-chines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.

Fast and Accurate K-means Clustering based on Density Peaks

Nigro Libero
;
Cicirelli Franco
2022-01-01

Abstract

K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centres (centroids) and tends to be stuck in a local sub-optimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an im-proved version of the K-means -DPKM- based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering prob-lems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique bor-rowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multi-core ma-chines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.
2022
Clustering, K-means, Centroids Initialization, Density Peaks, DK-means++, Java, Parallel Streams, Lambda Expressions, Multi-core Machines
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/341397
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact