A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

IRIS

A histogram over a multi-dimensional data set is a synopsis consisting of aggregate data summarizing the values of the points inside non-overlapping ranges of the domain. Owing to their effectiveness in supporting a fast (though approximate) estimation of the answers of aggregate range queries, histograms are widely used in several contexts dealing with multi-dimensional data, especially those where the precision of the answers (within reasonable limits) is not the major requirement. However, the practical impact of histograms has been limited by the fact that, so far, no mechanism has been defined which provides a reliable (non-trivial) quantification of the degree of approximation of the query estimates. In this paper, this problem is addressed by introducing a probabilistic framework which allows for estimating the accuracy of the approximate answers resulting from evaluating aggregate queries over a histogram. Specifically, given a histogram over a data set, the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored in the histogram. Therein, the mean value and the variance of this random variable represent an estimate of the actual answer of the corresponding query and of the error rate, respectively. The proposed framework can exploit different kinds of aggregates (namely, sum and count) stored in the histogram, as well as integrity constraints defined over the original data.

A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

BUCCAFURRI F;FURFARO, Filippo;SACCA', Domenico

2012-01-01

Abstract

A histogram over a multi-dimensional data set is a synopsis consisting of aggregate data summarizing the values of the points inside non-overlapping ranges of the domain. Owing to their effectiveness in supporting a fast (though approximate) estimation of the answers of aggregate range queries, histograms are widely used in several contexts dealing with multi-dimensional data, especially those where the precision of the answers (within reasonable limits) is not the major requirement. However, the practical impact of histograms has been limited by the fact that, so far, no mechanism has been defined which provides a reliable (non-trivial) quantification of the degree of approximation of the query estimates. In this paper, this problem is addressed by introducing a probabilistic framework which allows for estimating the accuracy of the approximate answers resulting from evaluating aggregate queries over a histogram. Specifically, given a histogram over a data set, the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored in the histogram. Therein, the mean value and the variance of this random variable represent an estimate of the actual answer of the corresponding query and of the error rate, respectively. The proposed framework can exploit different kinds of aggregates (namely, sum and count) stored in the histogram, as well as integrity constraints defined over the original data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2012
			
	Parole chiave
	
				Multidimensional data; data compression; range queries
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/156587

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

2

1

social impact