Sjclust: Towards a framework for integrating similarity join algorithms and clustering

IRIS

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.

Sjclust: Towards a framework for integrating similarity join algorithms and clustering

Ribeiro, Leonardo Andrade;CUZZOCREA, Alfredo Massimiliano;Bezerra, Karen Aline Alves;Do Nascimento, Ben Hur Bahia

2016-01-01

Abstract

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2016
		
	Codice ISBN
	
			9789897581878
		
	Parole chiave
	
			Clustering
Data cleaning
Data integration
Duplicate identification
Set similarity joins
Information Systems and Management
Computer Science Applications1707 Computer Vision and Pattern Recognition
		
	Appare nelle tipologie:
	
			4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/312812

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

9

3

social impact