Online String Attractors

Philip Whittington

Online String Attractors

Philip Whittington

TL;DR

This work studies online variants of string attractors and $k$-attractors in the streaming setting, connecting online attractors to Lempel–Ziv factorization and substring complexity. It proves that the online string attractor problem is $\mathcal{O}(\log n)$-competitive and that Lempel–Ziv effectively provides the optimal online strategy, with matching $\Omega(\log n)$ lower bounds demonstrated on Fibonacci and Thue–Morse sequences. For online $k$-attractors, the Lazy greedy approach is shown to be strictly $k$-competitive, with tight upper and lower bounds established through spoon-feeding constructions and de Bruijn sequences. The results illuminate the interplay between streaming compression, attractor sizes, and substring complexity, offering theoretical underpinnings for streaming dictionary compression and guiding future work on worst-case inputs relative to the optimal attractor size $\gamma^*$.

Abstract

In today's data-centric world, fast and effective compression of data is paramount. To measure success towards the second goal, Kempa and Prezza [STOC2018] introduce the string attractor, a combinatorial object unifying dictionary-based compression. Given a string $T \in Σ^n$, a string attractor ($k$-attractor) is a set of positions $Γ\subseteq [1,n]$, such that every distinct substring (of length at most $k$) has at least one occurrence that contains one of the selected positions. String attractors are shown to be approximated by and thus measure the quality of many important dictionary compression algorithms such as Lempel-Ziv 77, the Burrows-Wheeler transform, straight line programs, and macro schemes. In order to handle massive amounts of data, compression often has to be achieved in a streaming fashion. Thus, practically applied compression algorithms, such as Lempel-Ziv 77, have been extensively studied in an online setting. To the best of our knowledge, there has been no such work, and therefore are no theoretical underpinnings, for the string attractor problem. We introduce a natural online variant of both the $k$-attractor and the string attractor problem. First, we show that the Lempel-Ziv factorization corresponds to the best online algorithm for this problem, resulting in an upper bound of $\mathcal{O}(\log(n))$ on the competitive ratio. On the other hand, there are families of sparse strings which have constant-size optimal attractors, e.g., prefixes of the infinite Sturmian words and Thue-Morse words, which are created by iterative application of a morphism. We consider the most famous of these Sturmian words, the Fibonacci word, and show that any online algorithm has a cost growing with the length of the word, for a matching lower bound of $Ω(\log(n))$. For the online $k$-attractor problem, we show tight (strict) $k$-competitiveness.

Online String Attractors

TL;DR

This work studies online variants of string attractors and

-attractors in the streaming setting, connecting online attractors to Lempel–Ziv factorization and substring complexity. It proves that the online string attractor problem is

-competitive and that Lempel–Ziv effectively provides the optimal online strategy, with matching

lower bounds demonstrated on Fibonacci and Thue–Morse sequences. For online

-attractors, the Lazy greedy approach is shown to be strictly

-competitive, with tight upper and lower bounds established through spoon-feeding constructions and de Bruijn sequences. The results illuminate the interplay between streaming compression, attractor sizes, and substring complexity, offering theoretical underpinnings for streaming dictionary compression and guiding future work on worst-case inputs relative to the optimal attractor size

Abstract

, a string attractor (

-attractor) is a set of positions

, such that every distinct substring (of length at most

) has at least one occurrence that contains one of the selected positions. String attractors are shown to be approximated by and thus measure the quality of many important dictionary compression algorithms such as Lempel-Ziv 77, the Burrows-Wheeler transform, straight line programs, and macro schemes. In order to handle massive amounts of data, compression often has to be achieved in a streaming fashion. Thus, practically applied compression algorithms, such as Lempel-Ziv 77, have been extensively studied in an online setting. To the best of our knowledge, there has been no such work, and therefore are no theoretical underpinnings, for the string attractor problem. We introduce a natural online variant of both the

-attractor and the string attractor problem. First, we show that the Lempel-Ziv factorization corresponds to the best online algorithm for this problem, resulting in an upper bound of

on the competitive ratio. On the other hand, there are families of sparse strings which have constant-size optimal attractors, e.g., prefixes of the infinite Sturmian words and Thue-Morse words, which are created by iterative application of a morphism. We consider the most famous of these Sturmian words, the Fibonacci word, and show that any online algorithm has a cost growing with the length of the word, for a matching lower bound of

. For the online

-attractor problem, we show tight (strict)

-competitiveness.

Paper Structure (12 sections, 22 theorems, 16 equations, 2 figures, 3 tables)

This paper contains 12 sections, 22 theorems, 16 equations, 2 figures, 3 tables.

Introduction
Our Contributions
Lempel-Ziv Compression
Online Attractors
Limiting the Alphabet
Fibonacci Words
Thue-Morse Words
Conclusions for the Unrestricted Case
Limiting the Scope
Upper Bound
Lower Bound
Conclusion

Key Result

Theorem 4

There is no deterministic algorithm for the online $k$-attractor problem that has a better competitive ratio than the Lazy algorithm. Further, for each deterministic algorithm $A$ that is not Lazy (or equivalent to it), there are instances on which $A$ has higher costs than Lazy.

Figures (2)

Figure 1: $w = F_m[f_{m-1}, f_m - 2]$ appears in $F_{m-1}$.
Figure 2: $w = F_m[f_{m-1} - 1,f_m - 2]$ appears in $F_{m-1}$

Theorems & Definitions (31)

Definition 1: $k$-attractor roots
Definition 2: Lempel Ziv Factorization lempelzivthuemorseattractor
Definition 3: The Online $k$-Attractor Problem
Theorem 4
Theorem 5
Definition 6: Kernel word
Lemma 7
Lemma 8
Lemma 9
Theorem 10: Online cost of Fibonacci words
...and 21 more

Online String Attractors

TL;DR

Abstract

Online String Attractors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (31)