K-Means is a well-known clustering algorithm whose goal is partitioning a number of data points into groups (clusters), so as to minimize dissimilari-ties of data, measured by some metric, within the same group. Due to its simplicity, K-Means is often used in machine learning unsupervised cluster-ing applications. However, the execution performance of K-Means can easily become a bottleneck when dealing with very large datasets, paired with a great number of clusters, as those encountered in many big data ecosystems. Therefore, many efforts are reported in the literature devoted to a paralleliza-tion of K-Means, both on shared-nothing and shared-memory architectures. This paper proposes a novel approach to parallel K-Means on multi/many core machines, which is based on the Theatre actor system developed in Ja-va. The realization is based on message-passing for synchronization among actors (workers) but also offers the possibility of sharing data, in a controlled and safe way, among the actors of the same computing node (theatre). The approach proves effective in delivering a high-performance execution. The paper first provides some background information about the basic K-Means algorithm and the Theatre architecture, then an actor-based parallel version of K-Means is described and experimented with.
Performance of Parallel K-Means based on Theatre
Cicirelli Franco;Nigro Libero;Pupo Francesco
2022-01-01
Abstract
K-Means is a well-known clustering algorithm whose goal is partitioning a number of data points into groups (clusters), so as to minimize dissimilari-ties of data, measured by some metric, within the same group. Due to its simplicity, K-Means is often used in machine learning unsupervised cluster-ing applications. However, the execution performance of K-Means can easily become a bottleneck when dealing with very large datasets, paired with a great number of clusters, as those encountered in many big data ecosystems. Therefore, many efforts are reported in the literature devoted to a paralleliza-tion of K-Means, both on shared-nothing and shared-memory architectures. This paper proposes a novel approach to parallel K-Means on multi/many core machines, which is based on the Theatre actor system developed in Ja-va. The realization is based on message-passing for synchronization among actors (workers) but also offers the possibility of sharing data, in a controlled and safe way, among the actors of the same computing node (theatre). The approach proves effective in delivering a high-performance execution. The paper first provides some background information about the basic K-Means algorithm and the Theatre architecture, then an actor-based parallel version of K-Means is described and experimented with.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.