Table of Contents
Fetching ...

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

Alan M. Cleary, Joseph Winjum, Jordan Dood, Shunsuke Inenaga

TL;DR

The paper addresses pattern matching on grammar-compressed strings by using the CDAWG as an index for straight-line grammars (SLGs) that generate the text. The approach decouples random access from pattern matching, enabling any SLG with random access to be indexed by the CDAWG and achieving pattern matching in $O(\text{ra}(m)+\text{occ})$ time with $O(\text{er}(T))$ additional space, while remaining compatible with various random-access strategies. Empirical evaluation on data from the Pizza&Chili corpus and the Yeast Population Reference Panel demonstrates state-of-the-art runtimes and that the grammars produced are smaller than the right-extension bound, placing the CDAWG within the best known $O(\text{er}(T))$ space bound; the study also analyzes dataset-specific challenges, notably in DNA-like sequences. The work suggests that CDAWG-based analyses can be extended to a broad class of grammar-compressed strings, providing application-specific trade-offs and paving the way for further generalization to additional SLG forms.

Abstract

The compact directed acyclic word graph (CDAWG) is the minimal compact automaton that recognizes all the suffixes of a string. Classically the CDAWG has been implemented as an index of the string it recognizes, requiring $o(n)$ space for a copy of the string $T$ being indexed, where $n=|T|$. In this work, we propose using the CDAWG as an index for grammar-compressed strings. While this enables all analyses supported by the CDAWG on any grammar-compressed string, in this work we specifically consider pattern matching. Using the CDAWG index, pattern matching can be performed on any grammar-compressed string in $\mathcal{O}(\text{ra}(m)+\text{occ})$ time while requiring only $\mathcal{O}(\text{er}(T))$ additional space, where $m$ is the length of the pattern, $\text{ra}(m)$ is the grammar random access time, $\text{occ}$ is the number of occurrences of the pattern in $T$, and $\text{er}(T)$ is the number of right-extensions of the maximal repeats in $T$. Our experiments show that even when using a naïve random access algorithm, the CDAWG index achieves state of the art run-time performance for pattern matching on grammar-compressed strings. Additionally, we find that all of the grammars computed for our experiments are smaller than the number of right-extensions in the string they produce and, thus, their CDAWGs are within the best known $\mathcal{O}(\text{er}(T))$ space asymptotic bound.

The CDAWG Index and Pattern Matching on Grammar-Compressed Strings

TL;DR

The paper addresses pattern matching on grammar-compressed strings by using the CDAWG as an index for straight-line grammars (SLGs) that generate the text. The approach decouples random access from pattern matching, enabling any SLG with random access to be indexed by the CDAWG and achieving pattern matching in time with additional space, while remaining compatible with various random-access strategies. Empirical evaluation on data from the Pizza&Chili corpus and the Yeast Population Reference Panel demonstrates state-of-the-art runtimes and that the grammars produced are smaller than the right-extension bound, placing the CDAWG within the best known space bound; the study also analyzes dataset-specific challenges, notably in DNA-like sequences. The work suggests that CDAWG-based analyses can be extended to a broad class of grammar-compressed strings, providing application-specific trade-offs and paving the way for further generalization to additional SLG forms.

Abstract

The compact directed acyclic word graph (CDAWG) is the minimal compact automaton that recognizes all the suffixes of a string. Classically the CDAWG has been implemented as an index of the string it recognizes, requiring space for a copy of the string being indexed, where . In this work, we propose using the CDAWG as an index for grammar-compressed strings. While this enables all analyses supported by the CDAWG on any grammar-compressed string, in this work we specifically consider pattern matching. Using the CDAWG index, pattern matching can be performed on any grammar-compressed string in time while requiring only additional space, where is the length of the pattern, is the grammar random access time, is the number of occurrences of the pattern in , and is the number of right-extensions of the maximal repeats in . Our experiments show that even when using a naïve random access algorithm, the CDAWG index achieves state of the art run-time performance for pattern matching on grammar-compressed strings. Additionally, we find that all of the grammars computed for our experiments are smaller than the number of right-extensions in the string they produce and, thus, their CDAWGs are within the best known space asymptotic bound.
Paper Structure (19 sections, 1 figure, 2 tables)

This paper contains 19 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The CDAWG and a straight-line grammar for string $T = \text{AGAGCGAGAGCGCGC}\$$.