Table of Contents
Fetching ...

Analyzing and Leveraging the $k$-Sensitivity of LZ77

Gabriel Bathie, Paul Huber, Guillaume Lagarde, Akka Zemmari

TL;DR

An $\varepsilon$-approximation algorithm to pre-edit a word $w$ with a budget of $k$ modifications to improve its compression with a surprising trichotomy based on the compressibility of $w.

Abstract

We study the sensitivity of the Lempel-Ziv 77 compression algorithm to edits, showing how modifying a string $w$ can deteriorate or improve its compression. Our first result is a tight upper bound for $k$ edits: $\forall w' \in B(w,k)$, we have $C_{\mathrm{LZ77}}(w') \leq 3 \cdot C_{\mathrm{LZ77}}(w) + 4k$. This result contrasts with Lempel-Ziv 78, where a single edit can significantly deteriorate compressibility, a phenomenon known as a *one-bit catastrophe*. We further refine this bound, focusing on the coefficient $3$ in front of $C_{\mathrm{LZ77}}(w)$, and establish a surprising trichotomy based on the compressibility of $w$. More precisely we prove the following bounds: - if $C_{\mathrm{LZ77}}(w) \lesssim k^{3/2}\sqrt{n}$, the compression may increase by up to a factor of $\approx 3$, - if $k^{3/2}\sqrt{n} \lesssim C_{\mathrm{LZ77}}(w) \lesssim k^{1/3}n^{2/3}$, this factor is at most $\approx 2$, - if $C_{\mathrm{LZ77}}(w) \gtrsim k^{1/3}n^{2/3}$, the factor is at most $\approx 1$. Finally, we present an $\varepsilon$-approximation algorithm to pre-edit a word $w$ with a budget of $k$ modifications to improve its compression. In favorable scenarios, this approach yields a total compressed size reduction by up to a factor of~$3$, accounting for both the LZ77 compression of the modified word and the cost of storing the edits, $C_{\mathrm{LZ77}}(w') + k \log |w|$.

Analyzing and Leveraging the $k$-Sensitivity of LZ77

TL;DR

An -approximation algorithm to pre-edit a word with a budget of modifications to improve its compression with a surprising trichotomy based on the compressibility of $w.

Abstract

We study the sensitivity of the Lempel-Ziv 77 compression algorithm to edits, showing how modifying a string can deteriorate or improve its compression. Our first result is a tight upper bound for edits: , we have . This result contrasts with Lempel-Ziv 78, where a single edit can significantly deteriorate compressibility, a phenomenon known as a *one-bit catastrophe*. We further refine this bound, focusing on the coefficient in front of , and establish a surprising trichotomy based on the compressibility of . More precisely we prove the following bounds: - if , the compression may increase by up to a factor of , - if , this factor is at most , - if , the factor is at most . Finally, we present an -approximation algorithm to pre-edit a word with a budget of modifications to improve its compression. In favorable scenarios, this approach yields a total compressed size reduction by up to a factor of~, accounting for both the LZ77 compression of the modified word and the cost of storing the edits, .
Paper Structure (43 sections, 21 theorems, 127 equations, 3 figures)

This paper contains 43 sections, 21 theorems, 127 equations, 3 figures.

Key Result

Theorem 1

For all $k \in \mathbb{N}$, $w \in \Sigma^*$, and $w' \in B(w,k)$,

Figures (3)

  • Figure 1: Representation of the trichotomy when $n\rightarrow\infty$. $r_{max}$ represents the maximum $r=\frac{C_{\mathrm{LZ77}}(w')}{C_{\mathrm{LZ77}}(w)}$ a word of size $n$ can attain with the given compression size.
  • Figure 2: An example of LZ77 and LZ77sr parsing on the word "abbbababababbaa". Between parenthesis are the results of compressions of each block with, in order, the starting position of the referenced substring, its length and the letter added.
  • Figure 3: positions and relations of values used in Lemma \ref{['lem:jump-gap']}

Theorems & Definitions (48)

  • Theorem 1
  • Remark 1
  • Proposition 1
  • Theorem 2
  • Corollary 3
  • Remark 2
  • Proposition 3
  • Remark 3
  • Theorem 4
  • Theorem 5
  • ...and 38 more