Table of Contents
Fetching ...

On the compressiveness of the Burrows-Wheeler transform

Hideo Bannai, Tomohiro I, Yuto Nakashima

TL;DR

It is shown that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor, and there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT.

Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string $w$ into another string $\mathsf{BWT}(w)$. The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes $r$ and $r_B$. We also show that the so-called ``clustering effect'' of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

On the compressiveness of the Burrows-Wheeler transform

TL;DR

It is shown that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor, and there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT.

Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string into another string . The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes and . We also show that the so-called ``clustering effect'' of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

Paper Structure

This paper contains 14 sections, 19 theorems, 5 equations, 2 figures.

Key Result

Theorem 2

There exists an infinite family of strings such that $r = \Omega(r_B\log n)$.

Figures (2)

  • Figure 1: Illustration of the proof of Lemma \ref{['lemma:rLogarithmic']}. Every position in the set $\mathsf{pos}$ is preceded by an occurrence of $\mathtt{ab}$. We can obtain the lexicographic rank of the rotation $\mathsf{rot}^{i-2}(y_{k'+1})$ that starts with $\mathtt{ab}$ by using the corresponding rotation $\mathsf{rot}^{s(i)}(y_{k'})$. The preceded symbols (underlined symbols) of $y_{k'+1}[i-3]$ and $y_{k'}[\mathit{s}(i)-1]$ are the always same. The figure shows the case of $y_{k'+1}[j-3] = y_{k'}[\mathit{s}(j)-1] = \mathtt{a}$ by the position $i$ and the case of $y_{k'+1}[j-3] = y_{k'}[\mathit{s}(j)-1] = \mathtt{b}$ by the position $j$.
  • Figure 2: Example of Lemma \ref{['lemma:PsiOfFibIsRotation']} for $F_7$. The $\mathtt{a}$ at position $18$ is the $12$th $\mathtt{a}$ in $F_7[0..18]$, and thus the LF mapping should point to position $11$, whose Zeckendorf representation $Z_7(11)$ is a 1-bit left rotation of $Z_7(18)$. The $\mathtt{b}$ at position $17$ is the $7$th $\mathtt{b}$ in $F_7[0..17]$, and since there are $13$$\mathtt{a}$'s in $F_7$, the LF mapping should point to position $13+7-1=19$, whose Zeckendorf representation $Z_7(19)$ is a 2-bit left rotation of $Z_7(17)$.

Theorems & Definitions (20)

  • Theorem 2
  • Lemma 3
  • Theorem 4: Theorem 1 in christodoulakis_sofsem2006
  • Corollary 5
  • Corollary 6
  • Claim 7: Claim 5 in mieno_cpm2022
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • Lemma 11
  • ...and 10 more