Table of Contents
Fetching ...

Relating Left and Right Extensions of Maximal Repeats

Shunsuke Inenaga, Dmitry Kosolobov

TL;DR

The alphabet-dependent bound of the compact directed acyclic word graph, the CDAWG of the reversed string, is established and it is proved that this lower bound is tight.

Abstract

The compact directed acyclic word graph (CDAWG) of a string $T$ is an index occupying $O(\mathsf{e})$ space, where $\mathsf{e}$ is the number of right extensions of maximal repeats in $T$. For highly repetitive datasets, the measure $\mathsf{e}$ typically is small compared to the length $n$ of $T$ and, thus, the CDAWG serves as a compressed index. Unlike other compressibility measures (as LZ77, string attractors, BWT runs, etc.), $\mathsf{e}$ is very unstable with respect to reversals: the CDAWG of the reversed string $\overset{{}_{\leftarrow}}{T} = T[n] \cdots T[2] T[1]$ has size $O(\overset{{}_{\leftarrow}}{\mathsf{e}})$, where $\overset{{}_{\leftarrow}}{\mathsf{e}}$ is the number of left extensions of maximal repeats in $T$, and there are strings $T$ with $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \in Ω(\sqrt{n})$. In this note, we prove that this lower bound is tight: $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \in O(\sqrt{n})$. Furthermore, given the alphabet size $σ$, we establish the alphabet-dependent bound $\frac{\overset{{}_{\leftarrow}}{\mathsf{e}}}{\mathsf{e}} \le \min\{\frac{2n}σ, σ\}$ and we show that it is asymptotically tight.

Relating Left and Right Extensions of Maximal Repeats

TL;DR

The alphabet-dependent bound of the compact directed acyclic word graph, the CDAWG of the reversed string, is established and it is proved that this lower bound is tight.

Abstract

The compact directed acyclic word graph (CDAWG) of a string is an index occupying space, where is the number of right extensions of maximal repeats in . For highly repetitive datasets, the measure typically is small compared to the length of and, thus, the CDAWG serves as a compressed index. Unlike other compressibility measures (as LZ77, string attractors, BWT runs, etc.), is very unstable with respect to reversals: the CDAWG of the reversed string has size , where is the number of left extensions of maximal repeats in , and there are strings with . In this note, we prove that this lower bound is tight: . Furthermore, given the alphabet size , we establish the alphabet-dependent bound and we show that it is asymptotically tight.

Paper Structure

This paper contains 4 sections, 2 theorems, 2 equations.

Key Result

Theorem 1

For any string $T$ of length $n$, we have $\frac{ { \hbox{{\cr \hidewidth\reflectbox{$\m@th\vec{}\mkern4mu$}\hidewidth\cr {} $\m@th\mathsf{e}$\cr }}}}{\mathsf{e}} \le \min\{\frac{2n}{\sigma}, \sigma\}$, where $\sigma$ is the alphabet size and $\mathsf{e}$ and ${ \hbox{{\cr \hidewidth\reflectbox{$\m@

Theorems & Definitions (5)

  • Example
  • Theorem 1
  • proof
  • Theorem 2
  • proof