An incremental clustering scheme for duplicate detection in large databases

IRIS

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.

An incremental clustering scheme for duplicate detection in large databases

Cesario E.;Folino F.;Manco G.;Pontieri L.

2005-01-01

Abstract

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2005
		
	Codice ISBN
	
			0-7695-2404-4
		
	Appare nelle tipologie:
	
			4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/303525

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

8

7

social impact