Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation.
Compressed hierarchical binary histograms for summarizing multi-dimensional data
FURFARO, Filippo;SACCA', Domenico;
2008-01-01
Abstract
Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.