Table of Contents
Fetching ...

Sensitivity of Repetitiveness Measures to String Reversal

Hideo Bannai, Yuto Fujie, Peaker Guo, Shunsuke Inenaga, Yuto Nakashima, Simon J. Puglisi, Cristian Urbina

TL;DR

The paper investigates how reversing a string affects a broad set of repetitiveness measures, including RLBWT variants, Lempel–Ziv parses and their variants, and the lexicographic parse. It provides new linear (Theta(n)) additive lower bounds for the sensitivity of RLBWT measures (r, r_dol, r_B) and, via carefully constructed infinite string families, demonstrates tight asymptotics. For LZ parsing, it proves that the ratio z(w^R)/z(w) can approach 3, and that the additive change z(w^R)-z(w) can be linear in n, with analogous results for z_no and z_end; for the lex-parse, a Theta(log n) multiplicative sensitivity is shown, along with a linear additive gap in a Fibonacci-based construction. Together, these results reveal substantial limitations of many practical repetitiveness measures under simple data transformations and identify open questions about exact constants and tighter bounds.

Abstract

We study the impact that string reversal can have on several repetitiveness measures. First, we exhibit an infinite family of strings where the number, $r$, of runs in the run-length encoding of the Burrows--Wheeler transform (BWT) can increase additively by $Θ(n)$ when reversing the string. This substantially improves the known $Ω(\log n)$ lower-bound for the additive sensitivity of $r$ and it is asymptotically tight. We generalize our result to other variants of the BWT, including the variant with an appended end-of-string symbol and the bijective BWT. We show that an analogous result holds for the size $z$ of the Lempel--Ziv 77 (LZ) parsing of the text, and also for some of its variants, including the non-overlapping LZ parsing, and the LZ-end parsing. Moreover, we describe a family of strings for which the ratio $z(w^R)/z(w)$ approaches $3$ from below as $|w|\rightarrow \infty$. We also show an asymptotically tight lower-bound of $Θ(n)$ for the additive sensitivity of the size $v$ of the smallest lexicographic parsing to string reversal. Finally, we show that the multiplicative sensitivity of $v$ to reversing the string is $Θ(\log n)$, and this lower-bound is also tight. Overall, our results expose the limitations of repetitiveness measures that are widely used in practice, against string reversal -- a simple and natural data transformation.

Sensitivity of Repetitiveness Measures to String Reversal

TL;DR

The paper investigates how reversing a string affects a broad set of repetitiveness measures, including RLBWT variants, Lempel–Ziv parses and their variants, and the lexicographic parse. It provides new linear (Theta(n)) additive lower bounds for the sensitivity of RLBWT measures (r, r_dol, r_B) and, via carefully constructed infinite string families, demonstrates tight asymptotics. For LZ parsing, it proves that the ratio z(w^R)/z(w) can approach 3, and that the additive change z(w^R)-z(w) can be linear in n, with analogous results for z_no and z_end; for the lex-parse, a Theta(log n) multiplicative sensitivity is shown, along with a linear additive gap in a Fibonacci-based construction. Together, these results reveal substantial limitations of many practical repetitiveness measures under simple data transformations and identify open questions about exact constants and tighter bounds.

Abstract

We study the impact that string reversal can have on several repetitiveness measures. First, we exhibit an infinite family of strings where the number, , of runs in the run-length encoding of the Burrows--Wheeler transform (BWT) can increase additively by when reversing the string. This substantially improves the known lower-bound for the additive sensitivity of and it is asymptotically tight. We generalize our result to other variants of the BWT, including the variant with an appended end-of-string symbol and the bijective BWT. We show that an analogous result holds for the size of the Lempel--Ziv 77 (LZ) parsing of the text, and also for some of its variants, including the non-overlapping LZ parsing, and the LZ-end parsing. Moreover, we describe a family of strings for which the ratio approaches from below as . We also show an asymptotically tight lower-bound of for the additive sensitivity of the size of the smallest lexicographic parsing to string reversal. Finally, we show that the multiplicative sensitivity of to reversing the string is , and this lower-bound is also tight. Overall, our results expose the limitations of repetitiveness measures that are widely used in practice, against string reversal -- a simple and natural data transformation.
Paper Structure (15 sections, 24 theorems, 4 equations, 4 figures, 1 table)

This paper contains 15 sections, 24 theorems, 4 equations, 4 figures, 1 table.

Key Result

Lemma 8

$\BWT(u_k) = \asym^{2k}(\prod_{i=1}^k\bsym\lsym_i) \rsym_k (\prod_{i=1}^{k-1} \rsym_{i})$ and $r(u_k)=3k+1$.

Figures (4)

  • Figure 1: Illustration of \ref{['le:r_uk']}, \ref{['le:r_ukR']}, and \ref{['prop:add-sen-r-dol']} on $u_3= \bsym \asym \lsym_1 \asym \rsym_1 \bsym \asym \lsym_2 \asym \rsym_2 \bsym \asym \lsym_3 \asym \rsym_3$. The BWT matrices of relevant strings are illustrated, with the prefixes of the rotations shown alongside the BWTs. The changes caused by appending $\dol$ are highlighted in blue.
  • Figure 2: Illustration of $T_p = G_1 \cdots G_{m_p}$ (left) and $T_p^R = G_{m_p}^R \cdots G_1^R$ (right) for $p=4$. Note that $\mathcal{A}_4 = a_4 a_3 a_2 a_1 \cdot a_3 a_2 a_1 \cdot a_2 a_1 \cdot a_1$, $\mathcal{B}_4 = b_1 \cdot b_2 b_1 \cdot b_3 b_2 b_1 \cdot b_4 b_3 b_2 b_1$, and $m_4 = 10$. The colored boundaries illustrate \ref{['lem:lz-t-p']} and \ref{['lem:lz-t-p-r']}.
  • Figure 3: Illustration of \ref{['lem:v-odd-fib-rev-c']}. Top: several factorizations of $w = F_k^R \c$ for odd $k$. Bottom: the sorted suffixes of $w$ (suffixes starting with $\a$, $\b$, and $\c$ are shown in three colors on the left; the ordinals indicate the order of the six phrases in the lex-parse of $w$ highlighted in gray.)
  • Figure 4: Illustration of \ref{['lem:v-w-sigma']} and \ref{['lem:v-w-sigma-r']}: prefixes of sorted suffixes of $w_\sigma$ (left) and $w_\sigma^R$ (right). The ordinals indicate the order of the highlighted phrases for $\sigma=6$ where $w_6 = a_1 a_2 a_2 a_3 a_3 a_4 a_4 a_5 a_5 a_6 a_1 a_2 a_3 a_4 a_5 a_6$.

Theorems & Definitions (34)

  • Definition 1
  • Definition 2
  • Example 3
  • Definition 4
  • Definition 5
  • Example 6
  • Definition 7: $u_k$
  • Lemma 8
  • Lemma 9
  • Proposition 10
  • ...and 24 more