Range (Rényi) Entropy Queries and Partitioning
Aryan Esmailpour, Sanjay Krishnan, Stavros Sintos
TL;DR
This work addresses the problem of computing Shannon and Rényi entropies for subsets of weighted, colored points in fixed dimensions, where subsets are defined by query rectangles. It introduces efficient data structures for range S-entropy and R-entropy queries, proving conditional lower bounds that preclude near-linear space with polylog-time queries, and provides exact data structures for 1D and higher dimensions with o(n^{2d}) space and o(n) query time. In addition, the authors develop near-linear-space approximate structures for additive and multiplicative approximations in both entropies, including specialized 1D and higher-dimensional constructions. They also show how these entropy-query structures enable partitioning and histogram applications, and discuss connections to range-colored queries and existing streaming/dual-access entropy results. Overall, the paper advances both theory and practice by delivering a versatile toolkit for entropy-based data analysis over geometric query ranges, with meaningful implications for compression, histograms, and data cleaning.
Abstract
Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$, where $d$ is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in $P\cap R$.
