Range (Rényi) Entropy Queries and Partitioning

Aryan Esmailpour; Sanjay Krishnan; Stavros Sintos

Range (Rényi) Entropy Queries and Partitioning

Aryan Esmailpour, Sanjay Krishnan, Stavros Sintos

TL;DR

This work addresses the problem of computing Shannon and Rényi entropies for subsets of weighted, colored points in fixed dimensions, where subsets are defined by query rectangles. It introduces efficient data structures for range S-entropy and R-entropy queries, proving conditional lower bounds that preclude near-linear space with polylog-time queries, and provides exact data structures for 1D and higher dimensions with o(n^{2d}) space and o(n) query time. In addition, the authors develop near-linear-space approximate structures for additive and multiplicative approximations in both entropies, including specialized 1D and higher-dimensional constructions. They also show how these entropy-query structures enable partitioning and histogram applications, and discuss connections to range-colored queries and existing streaming/dual-access entropy results. Overall, the paper advances both theory and practice by delivering a versatile toolkit for entropy-based data analysis over geometric query ranges, with meaningful implications for compression, histograms, and data cleaning.

Abstract

Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$, where $d$ is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in $P\cap R$.

Range (Rényi) Entropy Queries and Partitioning

TL;DR

Abstract

weighted and colored points in

, where

is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle

, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in

, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for

and

with

space and

query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in

Paper Structure (65 sections, 33 theorems, 51 equations, 5 figures, 2 tables)

This paper contains 65 sections, 33 theorems, 51 equations, 5 figures, 2 tables.

Introduction
Useful notation.
Summary of Results.
Comparison with the conference version.
Related work.
Preliminaries
Updating the Shannon entropy.
Updating the Rényi entropy.
Range queries.
Range tree and sampling.
Range trees for S-entropy and R-entropy queries in $\tilde{O}(m)$ query time.
Expected Shannon entropy and monotonicity.
Lower Bounds
Preprocessing-query tradeoff
Extension to range R-entropy query.
...and 50 more sections

Key Result

Lemma 3.1

In the preceding reduction, $c_{i,j} = 0$ if and only if $H_{i,j} = H'_{i,j}$.

Figures (5)

Figure 1: A set $P$ of $20$ points in $\mathbb{R}^2$. For simplicity, assume that the weight of every point is $1$, i.e., $w(p)=1$ for every $p\in P$. Each point is associated with one color (or category) red, green, blue, or purple. There are three different colors among the points in $P\cap R$, namely red, green, and blue. The distribution $\mathcal{D}_R$ is defined over $3$ outcomes: red, green, and blue. The probability of red is $\mathcal{D}_R(\mathsf{red})=\frac{2}{9}$ because there are $2$ red points and $9$ total points in $P\cap R$. Similarly, the probability of green is $\mathcal{D}_R(\mathsf{green})=\frac{3}{9}$ and the probability of blue is $\mathcal{D}_R(\mathsf{blue})=\frac{4}{9}$. We have $H(P\cap R)=H(\mathcal{D}_R)=\frac{2}{9}\log\frac{9}{2}+\frac{3}{9}\log\frac{9}{3}+\frac{4}{9}\log\frac{9}{4}\approx 1.53$ and $H_2(P\cap R)=H_2(\mathcal{D}_R)=-\log\left((2/9)^2+(3/9)^2+(4/9)^2\right)\approx 1.48$.
Figure 2: An example of constructing the point set $P$ on the right based on two sample $3 \times 3$ matrices $A$ and $B$ on the left. The colors red, green, and blue represent colors $1$, $2$, and $3$, respectively. Points from $1$ to $9$, represent the points in $A_1A_2A_3$ corresponding to the rows of $A$, and the points from $10$ to $18$ represent the points in $B_1B_2B_3$ corresponding to the columns of $B$. Blocks are separated by vertical dashed lines. The interval $\rho_{2, 2}$ which is used to find the entry $c_{2,2}$ contains the points $6$ to $14$ as shown.
Figure 3: Lower bound construction.
Figure 4: Instance of the query algorithm given query interval $R$. Purple points are points in $P_R$.
Figure 5: Partition $P$ into $K$ buckets in $\mathbb{R}^2$. Two consecutive buckets have at most one color in common.

Theorems & Definitions (57)

Example 1.1: Columnar Compression
Example 1.2: Histogram Construction
Example 1.3: Data Cleaning
Example 1.4: Diversity index
Example 1.5: Network-Traffic Anomaly Detection
Definition 1.6: Range S-entropy query problem
Definition 1.7: Range R-entropy query problem
Lemma 3.1
Lemma 3.1
proof
...and 47 more

Range (Rényi) Entropy Queries and Partitioning

TL;DR

Abstract

Range (Rényi) Entropy Queries and Partitioning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (57)