All-Pairs Suffix-Prefix on Fully Dynamic Set of Strings

Masaru Kikuchi; Shunsuke Inenaga

All-Pairs Suffix-Prefix on Fully Dynamic Set of Strings

Masaru Kikuchi, Shunsuke Inenaga

TL;DR

This paper addresses the all-pairs suffix-prefix (APSP) problem under dynamic settings, presenting an $O(n)$-space data structure that, for each newly arriving string $S_i$, computes both $\mathcal{F}_i$ and $\mathcal{B}_i$ in $O(|S_i| \log \sigma + i)$ time. The approach leverages a DAWG-based dynamic framework to update and query overlaps efficiently, with a suffix-tree-based extension to handle deletions in a fully dynamic setting, achieving amortized $O(|S_i| \log \sigma + k)$ per update where $k$ is the current set size. A separate static-APSP algorithm based on AC-automata and a compact prefix-trie provides a simple, linear-space baseline that matches static-state performance up to a $\log \sigma$ factor. Together, these results yield near-optimal dynamic algorithms for APSP, applicable to genome assembly and other string-processing domains, and open avenues for extensions to dynamic hierarchical overlap graphs. The work demonstrates how combining DAWGs, AC-automata, and suffix trees enables efficient real-time maintenance of suffix-prefix relationships in growing and shrinking string collections.

Abstract

The all-pairs suffix-prefix (APSP) problem is a classical problem in string processing which has important applications in bioinformatics. Given a set $\mathcal{S} = \{S_1, \ldots, S_k\}$ of $k$ strings, the APSP problem asks one to compute the longest suffix of $S_i$ that is a prefix of $S_j$ for all $k^2$ ordered pairs $\langle S_i, S_j \rangle$ of strings in $\mathcal{S}$. In this paper, we consider the dynamic version of the APSP problem that allows for insertions of new strings to the set of strings. Our objective is, each time a new string $S_i$ arrives to the current set $\mathcal{S}_{i-1} = \{S_1, \ldots, S_{i-1}\}$ of $i-1$ strings, to compute (1) the longest suffix of $S_i$ that is a prefix of $S_j$ and (2) the longest prefix of $S_i$ that is a suffix of $S_j$ for all $1 \leq j \leq i$. We propose an $O(n)$-space data structure which computes (1) and (2) in $O(|S_i| \log σ+ i)$ time for each new given string $S_i$, where $n$ is the total length of the strings. Further, we show how to extend our methods to the fully dynamic version of the APSP problem allowing for both insertions and deletions of strings.

All-Pairs Suffix-Prefix on Fully Dynamic Set of Strings

TL;DR

This paper addresses the all-pairs suffix-prefix (APSP) problem under dynamic settings, presenting an

-space data structure that, for each newly arriving string

, computes both

and

time. The approach leverages a DAWG-based dynamic framework to update and query overlaps efficiently, with a suffix-tree-based extension to handle deletions in a fully dynamic setting, achieving amortized

per update where

is the current set size. A separate static-APSP algorithm based on AC-automata and a compact prefix-trie provides a simple, linear-space baseline that matches static-state performance up to a

factor. Together, these results yield near-optimal dynamic algorithms for APSP, applicable to genome assembly and other string-processing domains, and open avenues for extensions to dynamic hierarchical overlap graphs. The work demonstrates how combining DAWGs, AC-automata, and suffix trees enables efficient real-time maintenance of suffix-prefix relationships in growing and shrinking string collections.

Abstract

The all-pairs suffix-prefix (APSP) problem is a classical problem in string processing which has important applications in bioinformatics. Given a set

strings, the APSP problem asks one to compute the longest suffix of

that is a prefix of

for all

ordered pairs

of strings in

. In this paper, we consider the dynamic version of the APSP problem that allows for insertions of new strings to the set of strings. Our objective is, each time a new string

arrives to the current set

strings, to compute (1) the longest suffix of

that is a prefix of

and (2) the longest prefix of

that is a suffix of

for all

. We propose an

-space data structure which computes (1) and (2) in

time for each new given string

, where

is the total length of the strings. Further, we show how to extend our methods to the fully dynamic version of the APSP problem allowing for both insertions and deletions of strings.

Paper Structure (20 sections, 11 theorems, 9 equations, 6 figures)

This paper contains 20 sections, 11 theorems, 9 equations, 6 figures.

Introduction
Preliminaries
Strings
All-Pairs Suffix-Prefix Overlap (APSP) Problems
APSP Problems on Static Sets
APSP on Dynamic Sets with Insertions
APSP on Fully Dynamic Sets with Insertions and Deletions
Tools
Tries and Compact Tries
Aho-Corasick Automata
Directed acyclic word graphs (DAWGs)
Suffix trees
Algorithm for Static-APSP
Algorithm for Dynamic APSP
Computing $\mathcal{F}_i$
...and 5 more sections

Key Result

Theorem 1

For a set $\mathcal{S}$ of strings of total length $n$, $\mathsf{AC}(\mathcal{S})$ can be built

Figures (6)

Figure 1: Illustrations of $\mathsf{AC}(\mathcal{S})$ (left) and $\mathsf{ComTrie}(\mathcal{S})$ (right) for the set $\mathcal{S} = \{\mathrm{abaa, abac, abb, abcb, bab, babaa, bb, bbaa, bbba}\}$ of strings. The bold solid arcs represent trie edges and the dashed arcs represent failure links. The nodes representing the strings in $\mathcal{S}$ are depicted by double-lined circles with the string id's.
Figure 2: $\mathsf{DAWG}(\mathcal{S})$ for the same set $\mathcal{S} = \{\mathrm{abaa, abac, abb, abcb, bab, babaa, bb, bbaa, bbba}\}$ of strings as in Fig. \ref{['fig:AC']}. The induced tree consisting only of the double-lined arcs is $\mathsf{Trie}(\mathcal{S})$.
Figure 3: Illustration of the suffix links of $\mathsf{DAWG}(\mathcal{S})$ for the same set $\mathcal{S}$ of strings as in Fig \ref{['fig:DAWG']}.
Figure 4: $\mathsf{STree}(\mathcal{S})$ for the same set $\mathcal{S} = \{\mathrm{abaa, abac, abb, abcb, bab, babaa, bb, bbaa, bbba}\}$ of strings as in Fig. \ref{['fig:AC']}. The induced tree consisting only of the double-lined arcs is a compacted version of $\mathsf{Trie}(\mathcal{S})$.
Figure 5: Illustration of the suffix links of $\mathsf{STree}(\mathcal{S})$ for the same set $\mathcal{S}$ of strings as in Fig \ref{['fig:STree']}.
...and 1 more figures

Theorems & Definitions (14)

Theorem 1: DoriL06Aho1975StringMatching
Theorem 2: Blumer1987
Theorem 3: Weiner73Ukkonen95TakagiIABH20
Theorem 4
proof
Corollary 1
proof
Theorem 5
Corollary 2
Lemma 1
...and 4 more

All-Pairs Suffix-Prefix on Fully Dynamic Set of Strings

TL;DR

Abstract

All-Pairs Suffix-Prefix on Fully Dynamic Set of Strings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)