A histogram over a multi-dimensional data set is a synopsis consisting of aggregate data summarizing the values of the points inside non-overlapping ranges of the domain. Owing to their effectiveness in supporting a fast (though approximate) estimation of the answers of aggregate range queries, histograms are widely used in several contexts dealing with multi-dimensional data, especially those where the precision of the answers (within reasonable limits) is not the major requirement. However, the practical impact of histograms has been limited by the fact that, so far, no mechanism has been defined which provides a reliable (non-trivial) quantification of the degree of approximation of the query estimates. In this paper, this problem is addressed by introducing a probabilistic framework which allows for estimating the accuracy of the approximate answers resulting from evaluating aggregate queries over a histogram. Specifically, given a histogram over a data set, the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored in the histogram. Therein, the mean value and the variance of this random variable represent an estimate of the actual answer of the corresponding query and of the error rate, respectively. The proposed framework can exploit different kinds of aggregates (namely, sum and count) stored in the histogram, as well as integrity constraints defined over the original data.
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms
FURFARO, Filippo;SACCA', Domenico
2012-01-01
Abstract
A histogram over a multi-dimensional data set is a synopsis consisting of aggregate data summarizing the values of the points inside non-overlapping ranges of the domain. Owing to their effectiveness in supporting a fast (though approximate) estimation of the answers of aggregate range queries, histograms are widely used in several contexts dealing with multi-dimensional data, especially those where the precision of the answers (within reasonable limits) is not the major requirement. However, the practical impact of histograms has been limited by the fact that, so far, no mechanism has been defined which provides a reliable (non-trivial) quantification of the degree of approximation of the query estimates. In this paper, this problem is addressed by introducing a probabilistic framework which allows for estimating the accuracy of the approximate answers resulting from evaluating aggregate queries over a histogram. Specifically, given a histogram over a data set, the answer of an aggregate range query is modeled as a random variable, whose probability distribution depends on the type and the values of the aggregate data stored in the histogram. Therein, the mean value and the variance of this random variable represent an estimate of the actual answer of the corresponding query and of the error rate, respectively. The proposed framework can exploit different kinds of aggregates (namely, sum and count) stored in the histogram, as well as integrity constraints defined over the original data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.