The landscape of compressibility measures for two-dimensional data

Lorenzo Carfagna; Giovanni Manzini

The landscape of compressibility measures for two-dimensional data

Lorenzo Carfagna, Giovanni Manzini

TL;DR

This work introduces two-dimensional generalizations of string attractor measures by defining $\gamma_{2D}$ (smallest 2D attractor) and $\delta_{2D}$ (distinct $k\times k$ submatrices) for matrices, and a 2D bidirectional macro-scheme measure $b_{2D}$. It establishes core theoretical properties: $\delta_{2D}$ is computable in $O(n^2)$ time, $\gamma_{2D}$ is NP-complete to compute, and $\delta_{2D} \leq \gamma_{2D}$ with potentially large gaps up to $\Omega(\sqrt{n})$; it also analyzes the space behavior of the two-dimensional block tree (2D-BT) in terms of $\delta_{2D}$ and $\gamma_{2D}$ and provides a linear-time, linear-space algorithm to construct the 2D-BT for arbitrary matrices. The paper ties these measures to practical representations, showing bounds on 2D-BT space and presenting an attractor-based construction that yields efficient macro schemes and improved understanding of the relationships among $\gamma_{2D}$, $\delta_{2D}$, and $b_{2D}$, with implications for compressible 2D data and related indices. It also discusses extensions to 3D, potential Lempel–Ziv analogues in 2D, and open questions regarding tightening bounds and practical implementations.

Abstract

In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $γ$ measure defined in terms of the smallest string attractor, and the $δ$ measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures $γ_{2D}$ and $δ_{2D}$, as natural generalizations of $γ$ and $δ$, and we initiate the study of their properties. Among other things, we prove that $δ_{2D}$ is monotone and can be computed in linear time, and we show that, although it is still true that $δ_{2D} \leq γ_{2D}$, the gap between the two measures can be $Ω(\sqrt{n})$ and therefore asymptotically larger than the gap between $γ$ and $δ$. To complete the scenario of two-dimensional compressibility measures, we introduce the measure $b_{2D}$ which generalizes to two dimensions the notion of optimal parsing. We prove that, somewhat surprisingly, the relationship between $b_{2D}$ and $γ_{2D}$ is significantly different than in the one-dimensional case. As an application of our results we provide the first analysis of the space usage of the two-dimensional block tree introduced in [Brisaboa et al., Two-dimensional block trees, The computer Journal, 2024]. Our analysis shows that the space usage can be bounded in terms of both $γ_{2D}$ and $δ_{2D}$. Finally, using insights from our analysis, we design the first linear time and space algorithm for constructing the two-dimensional block tree for arbitrary matrices.

The landscape of compressibility measures for two-dimensional data

TL;DR

This work introduces two-dimensional generalizations of string attractor measures by defining

(smallest 2D attractor) and

(distinct

submatrices) for matrices, and a 2D bidirectional macro-scheme measure

. It establishes core theoretical properties:

is computable in

time,

is NP-complete to compute, and

with potentially large gaps up to

; it also analyzes the space behavior of the two-dimensional block tree (2D-BT) in terms of

and

and provides a linear-time, linear-space algorithm to construct the 2D-BT for arbitrary matrices. The paper ties these measures to practical representations, showing bounds on 2D-BT space and presenting an attractor-based construction that yields efficient macro schemes and improved understanding of the relationships among

, and

, with implications for compressible 2D data and related indices. It also discusses extensions to 3D, potential Lempel–Ziv analogues in 2D, and open questions regarding tightening bounds and practical implementations.

Abstract

In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the

measure defined in terms of the smallest string attractor, and the

measure defined in terms of the number of distinct substrings of the input string. Concretely, we introduce the two-dimensional measures

and

, as natural generalizations of

and

, and we initiate the study of their properties. Among other things, we prove that

is monotone and can be computed in linear time, and we show that, although it is still true that

, the gap between the two measures can be

and therefore asymptotically larger than the gap between

and

. To complete the scenario of two-dimensional compressibility measures, we introduce the measure

which generalizes to two dimensions the notion of optimal parsing. We prove that, somewhat surprisingly, the relationship between

and

is significantly different than in the one-dimensional case. As an application of our results we provide the first analysis of the space usage of the two-dimensional block tree introduced in [Brisaboa et al., Two-dimensional block trees, The computer Journal, 2024]. Our analysis shows that the space usage can be bounded in terms of both

and

. Finally, using insights from our analysis, we design the first linear time and space algorithm for constructing the two-dimensional block tree for arbitrary matrices.

Paper Structure (11 sections, 22 theorems, 13 equations, 5 figures)

This paper contains 11 sections, 22 theorems, 13 equations, 5 figures.

Introduction
Notation and background
Attractors for two-dimensional structures
The measure $\delta_{2D}$
A glimpse on 3D measures
Two-dimensional bidirectional macro schemes
Space Bounds for Two-Dimensional Block Trees
Two-Dimensional Block Trees construction
Concluding Remarks
Backward pointers computation
Construction of the 2D block tree for arbitrary matrix size

Key Result

lemma 1

Given a string $S \in \Sigma^n$, let $R^S \in \Sigma^{n\times n}$ be the square matrix where each row is equal to the string $S$. Then there exists an (1-dim) attractor for $S$ of size $k$ if and only if there exists a (2-dim) attractor of size $k$ for $R^S$.

Figures (5)

Figure 1: A square matrix $C$ on the left, and its Istring $I_C$ on the right (last two Icharacters are omitted)
Figure 2: The submatrix $A[2..5][1..4]=A_{21}$ with solid black border on the left and its Istring $I_{A_{21}}$ on the right. The Istring of the submatrix $A[2..3][1..2]$ (in red) is the third Iprefix of $I_{A_{21}}$.
Figure 3: The matrix used in the proof of Lemma \ref{['lemma:gap_b_gd_1']} (a), and a parsing of its upper left quadrant (b).
Figure 4: A block $X$ and its first occurrence $O$ in row-major order. If $X$ is not marked, its node $X_v$ in the 2D-BT will point to the four blocks overlapping $O$ and will store the offset $\langle x,y \rangle$ of $O$ with respect to the block including $O$'s top left corner. The pointed blocks are marked since $D$ includes $O$ and therefore is a block-marker. With reference to the proof of Lemma \ref{['lemma:2DBTnodes']}, $D$ is a type 3 block-marker: by considering every entry in the red block as an upper left corner we obtain $k^{2\ell}$ distinct $3k^\ell\times 3k^\ell$ submatrices containing $D$. The type 2 block-marker $D'$ borders the upper edge; by considering the $k^\ell$ entries in the first row marked in red we obtain $k^{\ell}$ distinct $3k^\ell\times 3k^\ell$ submatrices containing $D'$. We also show a type 2 block-marker $D"$ bordering the right edge; we obtain $k^{\ell}$ distinct $3k^\ell\times 3k^\ell$ submatrices containing $D"$ by considering the $3k^\ell\times 3k^\ell$ submatrices with the upper right corner in the portion of the last column marked in red.
Figure 5: Possible partitioning of a matrix whose size is not a power of $k$: we see that along the right and bottom margin there are rectangular blocks of size $a_\ell \times b_\ell$ or $b_\ell \times a_\ell$. $R$ and $R'$ are rectangular superblocks, if they are a first occurrence the four blocks they contain are marked. With reference to the proof of Lemma \ref{['lemma:2DBTnodes']}, the block-marker $D$ is adjacent to rectangular blocks, so we cannot guarantee that there are $k^{2\ell}$ distinct $3k^\ell\times 3k^\ell$ submatrices containing $D$. However, we obtain $k^{\ell}$ distinct $3k^\ell\times 3k^\ell$ submatrices containing $D$ by considering the $3k^\ell\times 3k^\ell$ submatrices with the upper right corner in the area marked in red.

Theorems & Definitions (51)

definition 1
definition 2
definition 3
definition 4
lemma 1
proof
theorem 1
lemma 2
proof
definition 5
...and 41 more

The landscape of compressibility measures for two-dimensional data

TL;DR

Abstract

The landscape of compressibility measures for two-dimensional data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (51)