Table of Contents
Fetching ...

Range (Rényi) Entropy Queries and Partitioning

Aryan Esmailpour, Sanjay Krishnan, Stavros Sintos

TL;DR

This work addresses the problem of computing Shannon and Rényi entropies for subsets of weighted, colored points in fixed dimensions, where subsets are defined by query rectangles. It introduces efficient data structures for range S-entropy and R-entropy queries, proving conditional lower bounds that preclude near-linear space with polylog-time queries, and provides exact data structures for 1D and higher dimensions with o(n^{2d}) space and o(n) query time. In addition, the authors develop near-linear-space approximate structures for additive and multiplicative approximations in both entropies, including specialized 1D and higher-dimensional constructions. They also show how these entropy-query structures enable partitioning and histogram applications, and discuss connections to range-colored queries and existing streaming/dual-access entropy results. Overall, the paper advances both theory and practice by delivering a versatile toolkit for entropy-based data analysis over geometric query ranges, with meaningful implications for compression, histograms, and data cleaning.

Abstract

Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$, where $d$ is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in $P\cap R$.

Range (Rényi) Entropy Queries and Partitioning

TL;DR

This work addresses the problem of computing Shannon and Rényi entropies for subsets of weighted, colored points in fixed dimensions, where subsets are defined by query rectangles. It introduces efficient data structures for range S-entropy and R-entropy queries, proving conditional lower bounds that preclude near-linear space with polylog-time queries, and provides exact data structures for 1D and higher dimensions with o(n^{2d}) space and o(n) query time. In addition, the authors develop near-linear-space approximate structures for additive and multiplicative approximations in both entropies, including specialized 1D and higher-dimensional constructions. They also show how these entropy-query structures enable partitioning and histogram applications, and discuss connections to range-colored queries and existing streaming/dual-access entropy results. Overall, the paper advances both theory and practice by delivering a versatile toolkit for entropy-based data analysis over geometric query ranges, with meaningful implications for compression, histograms, and data cleaning.

Abstract

Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the Rényi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the Rényi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set of weighted and colored points in , where is a constant. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle , it computes the Shannon (resp. Rényi) entropy based on the colors and the weights of the points in , in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for and with space and query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. Rényi) entropy in .
Paper Structure (65 sections, 33 theorems, 51 equations, 5 figures, 2 tables)

This paper contains 65 sections, 33 theorems, 51 equations, 5 figures, 2 tables.

Key Result

Lemma 3.1

In the preceding reduction, $c_{i,j} = 0$ if and only if $H_{i,j} = H'_{i,j}$.

Figures (5)

  • Figure 1: A set $P$ of $20$ points in $\mathbb{R}^2$. For simplicity, assume that the weight of every point is $1$, i.e., $w(p)=1$ for every $p\in P$. Each point is associated with one color (or category) red, green, blue, or purple. There are three different colors among the points in $P\cap R$, namely red, green, and blue. The distribution $\mathcal{D}_R$ is defined over $3$ outcomes: red, green, and blue. The probability of red is $\mathcal{D}_R(\mathsf{red})=\frac{2}{9}$ because there are $2$ red points and $9$ total points in $P\cap R$. Similarly, the probability of green is $\mathcal{D}_R(\mathsf{green})=\frac{3}{9}$ and the probability of blue is $\mathcal{D}_R(\mathsf{blue})=\frac{4}{9}$. We have $H(P\cap R)=H(\mathcal{D}_R)=\frac{2}{9}\log\frac{9}{2}+\frac{3}{9}\log\frac{9}{3}+\frac{4}{9}\log\frac{9}{4}\approx 1.53$ and $H_2(P\cap R)=H_2(\mathcal{D}_R)=-\log\left((2/9)^2+(3/9)^2+(4/9)^2\right)\approx 1.48$.
  • Figure 2: An example of constructing the point set $P$ on the right based on two sample $3 \times 3$ matrices $A$ and $B$ on the left. The colors red, green, and blue represent colors $1$, $2$, and $3$, respectively. Points from $1$ to $9$, represent the points in $A_1A_2A_3$ corresponding to the rows of $A$, and the points from $10$ to $18$ represent the points in $B_1B_2B_3$ corresponding to the columns of $B$. Blocks are separated by vertical dashed lines. The interval $\rho_{2, 2}$ which is used to find the entry $c_{2,2}$ contains the points $6$ to $14$ as shown.
  • Figure 3: Lower bound construction.
  • Figure 4: Instance of the query algorithm given query interval $R$. Purple points are points in $P_R$.
  • Figure 5: Partition $P$ into $K$ buckets in $\mathbb{R}^2$. Two consecutive buckets have at most one color in common.

Theorems & Definitions (57)

  • Example 1.1: Columnar Compression
  • Example 1.2: Histogram Construction
  • Example 1.3: Data Cleaning
  • Example 1.4: Diversity index
  • Example 1.5: Network-Traffic Anomaly Detection
  • Definition 1.6: Range S-entropy query problem
  • Definition 1.7: Range R-entropy query problem
  • Lemma 3.1
  • Lemma 3.1
  • proof
  • ...and 47 more