In this paper we present a novel approximate algorithm to calculate the top k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity 0(d^2 nk) and space complexity 0(nd) and guarantees a solution within a 0(d^{1 + \frac{1}{\tau }}) factor of the exact one, where \tau\in \{ 1,2, \ldots ,\infty \} denotes the Minkowski metrics L_\tauof interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a scan, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm (i) behaves as an exact algorithm in low dimensional spaces; (ii) it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; (iii) in any case it returns a solution within a small error to the exact one.

Top-k Closest Pairs Join Query: An Approximate Algorithm for Large High Dimensional Data

ANGIULLI, Fabrizio;
2004-01-01

Abstract

In this paper we present a novel approximate algorithm to calculate the top k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity 0(d^2 nk) and space complexity 0(nd) and guarantees a solution within a 0(d^{1 + \frac{1}{\tau }}) factor of the exact one, where \tau\in \{ 1,2, \ldots ,\infty \} denotes the Minkowski metrics L_\tauof interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a scan, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm (i) behaves as an exact algorithm in low dimensional spaces; (ii) it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; (iii) in any case it returns a solution within a small error to the exact one.
2004
0-7695-2168-1
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/169004
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 1
social impact