Online String Attractors
Philip Whittington
TL;DR
This work studies online variants of string attractors and $k$-attractors in the streaming setting, connecting online attractors to Lempel–Ziv factorization and substring complexity. It proves that the online string attractor problem is $\mathcal{O}(\log n)$-competitive and that Lempel–Ziv effectively provides the optimal online strategy, with matching $\Omega(\log n)$ lower bounds demonstrated on Fibonacci and Thue–Morse sequences. For online $k$-attractors, the Lazy greedy approach is shown to be strictly $k$-competitive, with tight upper and lower bounds established through spoon-feeding constructions and de Bruijn sequences. The results illuminate the interplay between streaming compression, attractor sizes, and substring complexity, offering theoretical underpinnings for streaming dictionary compression and guiding future work on worst-case inputs relative to the optimal attractor size $\gamma^*$.
Abstract
In today's data-centric world, fast and effective compression of data is paramount. To measure success towards the second goal, Kempa and Prezza [STOC2018] introduce the string attractor, a combinatorial object unifying dictionary-based compression. Given a string $T \in Σ^n$, a string attractor ($k$-attractor) is a set of positions $Γ\subseteq [1,n]$, such that every distinct substring (of length at most $k$) has at least one occurrence that contains one of the selected positions. String attractors are shown to be approximated by and thus measure the quality of many important dictionary compression algorithms such as Lempel-Ziv 77, the Burrows-Wheeler transform, straight line programs, and macro schemes. In order to handle massive amounts of data, compression often has to be achieved in a streaming fashion. Thus, practically applied compression algorithms, such as Lempel-Ziv 77, have been extensively studied in an online setting. To the best of our knowledge, there has been no such work, and therefore are no theoretical underpinnings, for the string attractor problem. We introduce a natural online variant of both the $k$-attractor and the string attractor problem. First, we show that the Lempel-Ziv factorization corresponds to the best online algorithm for this problem, resulting in an upper bound of $\mathcal{O}(\log(n))$ on the competitive ratio. On the other hand, there are families of sparse strings which have constant-size optimal attractors, e.g., prefixes of the infinite Sturmian words and Thue-Morse words, which are created by iterative application of a morphism. We consider the most famous of these Sturmian words, the Fibonacci word, and show that any online algorithm has a cost growing with the length of the word, for a matching lower bound of $Ω(\log(n))$. For the online $k$-attractor problem, we show tight (strict) $k$-competitiveness.
