K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centres (centroids) and tends to be stuck in a local sub-optimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an im-proved version of the K-means -DPKM- based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering prob-lems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique bor-rowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multi-core ma-chines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.
Fast and Accurate K-means Clustering based on Density Peaks
Nigro Libero
;Cicirelli Franco
2022-01-01
Abstract
K-means is a widespread clustering algorithm characterized by its simplicity and efficiency. K-means behavior, though, strongly depends on the initialization of the cluster centres (centroids) and tends to be stuck in a local sub-optimal solution. Many techniques have been devised to overcome these problems, e.g., by a global strategy to reduce locality of centroid adjustments or by using density peaks for centroids initialization. This paper proposes an im-proved version of the K-means -DPKM- based on the concepts of density peaks. Density peaks have been proved to be a key for solving clustering prob-lems where not-spherical regions with complex point distributions are involved. Centroids are actually selected from density peaks by using a technique bor-rowed from the DK-means++ initialization method, which ensures centroids are not only points with higher density, but also far-away from each other. DPKM is implemented in Java using parallel streams and lambda expressions which are capable of delivering good execution times on large datasets on multi-core ma-chines with shared memory. The efficiency and reliability of DPKM are demonstrated by applying it to challenging synthetic datasets often used as benchmarks for clustering methods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.