The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

Alan M. Cleary; Joseph Winjum; Jordan Dood; Shunsuke Inenaga

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

Alan M. Cleary, Joseph Winjum, Jordan Dood, Shunsuke Inenaga

TL;DR

The paper addresses pattern matching on grammar-compressed strings by using the CDAWG as an index for straight-line grammars (SLGs) that generate the text. The approach decouples random access from pattern matching, enabling any SLG with random access to be indexed by the CDAWG and achieving pattern matching in $O(\text{ra}(m)+\text{occ})$ time with $O(\text{er}(T))$ additional space, while remaining compatible with various random-access strategies. Empirical evaluation on data from the Pizza&Chili corpus and the Yeast Population Reference Panel demonstrates state-of-the-art runtimes and that the grammars produced are smaller than the right-extension bound, placing the CDAWG within the best known $O(\text{er}(T))$ space bound; the study also analyzes dataset-specific challenges, notably in DNA-like sequences. The work suggests that CDAWG-based analyses can be extended to a broad class of grammar-compressed strings, providing application-specific trade-offs and paving the way for further generalization to additional SLG forms.

Abstract

The compact directed acyclic word graph (CDAWG) is the minimal compact automaton that recognizes all the suffixes of a string. Classically the CDAWG has been implemented as an index of the string it recognizes, requiring $o(n)$ space for a copy of the string $T$ being indexed, where $n=|T|$. In this work, we propose using the CDAWG as an index for grammar-compressed strings. While this enables all analyses supported by the CDAWG on any grammar-compressed string, in this work we specifically consider pattern matching. Using the CDAWG index, pattern matching can be performed on any grammar-compressed string in $\mathcal{O}(\text{ra}(m)+\text{occ})$ time while requiring only $\mathcal{O}(\text{er}(T))$ additional space, where $m$ is the length of the pattern, $\text{ra}(m)$ is the grammar random access time, $\text{occ}$ is the number of occurrences of the pattern in $T$, and $\text{er}(T)$ is the number of right-extensions of the maximal repeats in $T$. Our experiments show that even when using a naïve random access algorithm, the CDAWG index achieves state of the art run-time performance for pattern matching on grammar-compressed strings. Additionally, we find that all of the grammars computed for our experiments are smaller than the number of right-extensions in the string they produce and, thus, their CDAWGs are within the best known $\mathcal{O}(\text{er}(T))$ space asymptotic bound.

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

TL;DR

time with

additional space, while remaining compatible with various random-access strategies. Empirical evaluation on data from the Pizza&Chili corpus and the Yeast Population Reference Panel demonstrates state-of-the-art runtimes and that the grammars produced are smaller than the right-extension bound, placing the CDAWG within the best known

space bound; the study also analyzes dataset-specific challenges, notably in DNA-like sequences. The work suggests that CDAWG-based analyses can be extended to a broad class of grammar-compressed strings, providing application-specific trade-offs and paving the way for further generalization to additional SLG forms.

Abstract

space for a copy of the string

being indexed, where

. In this work, we propose using the CDAWG as an index for grammar-compressed strings. While this enables all analyses supported by the CDAWG on any grammar-compressed string, in this work we specifically consider pattern matching. Using the CDAWG index, pattern matching can be performed on any grammar-compressed string in

time while requiring only

additional space, where

is the length of the pattern,

is the grammar random access time,

is the number of occurrences of the pattern in

, and

is the number of right-extensions of the maximal repeats in

. Our experiments show that even when using a naïve random access algorithm, the CDAWG index achieves state of the art run-time performance for pattern matching on grammar-compressed strings. Additionally, we find that all of the grammars computed for our experiments are smaller than the number of right-extensions in the string they produce and, thus, their CDAWGs are within the best known

space asymptotic bound.

Paper Structure (19 sections, 1 figure, 2 tables)

This paper contains 19 sections, 1 figure, 2 tables.

Introduction
Preliminaries
Strings
CDAWGs
Context-Free Grammars
Random Access and Pattern Matching
Related work
The CDAWG Index and Grammar-Compressed Strings
Algorithms and Data Structures
SLG Representation
Random Access
CDAWG Index Construction
SLG Pattern Matching
Results
Implementation
...and 4 more sections

Figures (1)

Figure 1: The CDAWG and a straight-line grammar for string $T = \text{AGAGCGAGAGCGCGC}\$$.

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

TL;DR

Abstract

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

Authors

TL;DR

Abstract

Table of Contents

Figures (1)