Table of Contents
Fetching ...

Encoding Co-Lex Orders of Finite-State Automata in Linear Space

Ruben Becker, Nicola Cotumaccio, Sung-Hwan Kim, Nicola Prezza, Carlo Tosoni

TL;DR

The paper tackles encoding the maximum co-lex (CFS) order for forward-stable NFAs in linear space, addressing a key obstacle to near-linear-time index construction for NFAs. It introduces a linear-space data structure that stores per-state left-minimal infimum walks, right-maximal supremum walks, and associated conflict-encoding integers, together with a co-lex extension, to support $O(n)$-time queries on a graph with $n$ states and $m$ transitions. A central contribution is the constructive proof of leftmost walk existence via the Forward Visit algorithm, which combines DFS to find a cycle with BFS-based propagation to all nodes, enabling a compact representation of leftmost paths. This work paves the way for near-linear-time computation of CFS orders and, consequently, faster, scalable indices for pattern finding on NFAs, with implications for large graphs such as pangenome graphs.

Abstract

The Burrows-Wheeler transform (BWT) is a string transformation that enhances string indexing and compressibility. Cotumaccio and Prezza [SODA '21] extended this transformation to nondeterministic finite automata (NFAs) through co-lexicographic partial orders, i.e., by sorting the states of an NFA according to the co-lexicographic order of the strings reaching them. As the BWT of an NFA shares many properties with its original string variant, the transformation can be used to implement indices for locating specific patterns on the NFA itself. The efficiency of the resulting index is influenced by the width of the partial order on the states: the smaller the width, the faster the index. The most efficient index for arbitrary NFAs currently known in the literature is based on the coarsest forward-stable co-lex (CFS) order of Becker et al. [SPIRE '24]. In this paper, we prove that this CFS order can be encoded within linear space in the number of states in the automaton. The importance of this result stems from the fact that encoding such an order in linear space represents a big first step in the direction of building the index based on this order in near-linear time -- the biggest open research question in this context. The currently most efficient known algorithm for this task run in quadratic time in the number of transitions in the NFA and are thus infeasible to be run on very large graphs (e.g., pangenome graphs). At this point, a near-linear time algorithm is solely known for the simpler case of deterministic automata [Becker et al., ESA '23] and, in fact, this algorithmic result was enabled by a linear space encoding for deterministic automata [Kim et al., CPM '23].

Encoding Co-Lex Orders of Finite-State Automata in Linear Space

TL;DR

The paper tackles encoding the maximum co-lex (CFS) order for forward-stable NFAs in linear space, addressing a key obstacle to near-linear-time index construction for NFAs. It introduces a linear-space data structure that stores per-state left-minimal infimum walks, right-maximal supremum walks, and associated conflict-encoding integers, together with a co-lex extension, to support -time queries on a graph with states and transitions. A central contribution is the constructive proof of leftmost walk existence via the Forward Visit algorithm, which combines DFS to find a cycle with BFS-based propagation to all nodes, enabling a compact representation of leftmost paths. This work paves the way for near-linear-time computation of CFS orders and, consequently, faster, scalable indices for pattern finding on NFAs, with implications for large graphs such as pangenome graphs.

Abstract

The Burrows-Wheeler transform (BWT) is a string transformation that enhances string indexing and compressibility. Cotumaccio and Prezza [SODA '21] extended this transformation to nondeterministic finite automata (NFAs) through co-lexicographic partial orders, i.e., by sorting the states of an NFA according to the co-lexicographic order of the strings reaching them. As the BWT of an NFA shares many properties with its original string variant, the transformation can be used to implement indices for locating specific patterns on the NFA itself. The efficiency of the resulting index is influenced by the width of the partial order on the states: the smaller the width, the faster the index. The most efficient index for arbitrary NFAs currently known in the literature is based on the coarsest forward-stable co-lex (CFS) order of Becker et al. [SPIRE '24]. In this paper, we prove that this CFS order can be encoded within linear space in the number of states in the automaton. The importance of this result stems from the fact that encoding such an order in linear space represents a big first step in the direction of building the index based on this order in near-linear time -- the biggest open research question in this context. The currently most efficient known algorithm for this task run in quadratic time in the number of transitions in the NFA and are thus infeasible to be run on very large graphs (e.g., pangenome graphs). At this point, a near-linear time algorithm is solely known for the simpler case of deterministic automata [Becker et al., ESA '23] and, in fact, this algorithmic result was enabled by a linear space encoding for deterministic automata [Kim et al., CPM '23].

Paper Structure

This paper contains 14 sections, 11 theorems, 2 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1.2

Given a forward-stable NFA with $n$ states, there exists a data structure for Problem main_pr taking $O(n)$ space and supporting queries in $O(n)$ time.

Figures (3)

  • Figure 1: Let $\leq_{FS}$ and $\leq$ be the maximum co-lex order (see Definition \ref{['def:3:colex_order']}) and a co-lex extension (see Definition \ref{['linear_ext']}) of a forward-stable NFA $\mathcal{A}$, respectively. Let $u$ and $v$ be any two states in $\mathcal{A}$. Denote with $P_{u}^{\sup}=(u_{i})_{i\geq1}$ a supremum right-maximal walk to the state $u$ and with $P_{v}^{\inf}=(v_{i})_{i\geq1}$ an infimum left-minimal walk to $v$ (see Def. \ref{['def:3:inf_sup_walks']} and \ref{['def:6:left_min_rig_max']}). The figure shows the decision tree representing all possible cases that may arise when determining whether $u \leq_{FS} v$. Here, $j$ is the smallest integer such that $v_{j} < u_{j}$, while $j'$ is the smallest integer such that $u_{j'} = v_{j'}$. Functions $\phi^{j'}$ and $\gamma^{j'}$ represent the deepest states in infimum/supremum conflict with the walks $P_u^{\sup}$ and $P_v^{\inf}$
  • Figure 2: Consider the forward-stable NFA $\mathcal{A}$ in Figure (a). Each state is assigned an integer $i$ indicating its position in the co-lex extension $\leq$. We denote by $u_{i}$ the $i$-th state according to $\leq$. Figures (b) and (c) show the NFAs encoding a left-minimal infimum walk and a right-maximal supremum walk, respectively, for each state. The table on the right shows for each state $u$ the values of $\inf I_{u}$, $\sup I_{u}$, $\phi(u,P_{u}^{\inf})$, and $\gamma(u,P_{u}^{\sup})$, where $P_{u}^{\inf}$ and $P_{u}^{\sup}$ are the walks shown in Figures (b) and (c). Our data structure comprises $\le$, the walks in Figures (b) and (c), and the two columns $\phi$, $\gamma$ from the table. We sketch the four cases that arise when determining whether $u <_{FS} v$ holds, assuming $u < v$. (i) By Lemma \ref{['lem:5:sup_inf_co_lex']}, since $\sup I_{u_3} \leq \inf I_{u_5}$, it follows that $u_3 <_{FS} u_{5}$. (ii) Consider $P_{u_{2}}^{\sup} = u_2,u_{13}\ldots$ and $P_{u_6}^{\inf} = u_6,u_9\ldots$, since $\sup I_{u_2} > \inf I_{u_6}$, and $u_2 < u_6, u_{13} > u_9$, by Lemma \ref{['lem:5:sup_inf_vs_walks']}, $\neg(u_{2} <_{FS} u_{6})$. (iii) Consider now $P_{u_4}^{\sup} = u_{4},u_{6}\ldots$ and $P_{u_7}^{\inf} = u_{7},u_{6}\ldots$. Since, $\sup I_{u_4} > \inf I_{u_7}$, $u_{4} < u_{7},u_{6} = u_6$, and $\max\{ \gamma^2(u_4, P_{u_4}^{\sup}), \phi^2(u_7, P_{u_7}^{\inf}) \} = 1 < 2$, by Lemma \ref{['lem:6:main_lemma']}, we can conclude $u_4 <_{FS} u_7$. (iv) Finally, consider $P_{u_{10}}^{\sup} = u_{10},u_{4},u_{6}\ldots$ and $P_{11}^{\inf} = u_{11},u_{7},u_{6}\ldots$, due to the fact that $\sup I_{u_{10}} > \inf I_{u_{11}}$, $u_{10} < u_{11}$, $u_{4} < u_{7}$, $u_{6} = u_{6}$, and $\max\{\gamma^3(u_{10}, P_{u_{10}}^{\sup}), \phi^3(u_{11}, P_{u_{11}}^{\inf}) \} = 26 \geq 3$, by Lemma \ref{['lem:6:main_lemma']}, we conclude that $\neg(u_{10} <_{FS} u_{11})$.
  • Figure 3: (a) A directed graph $G = (V,E)$ and a total order $\leq$ over $V$ represented by the integer names of nodes. (b) In green the first cycle $C$ that is found by the DFS of Algorithm \ref{['alg:6:left']}, if we start a DFS from node $2$; in red and blue the subgraphs $G_{L}$ and $G_{R}$ corresponding to $C$, respectively. Here, $L = \{3, 4\}$ and $R = \{11, 13\}$. (c) Leftmost walks represented by $p$ (indicated by the shown edges).

Theorems & Definitions (35)

  • Theorem 1.2
  • Definition 1.3: Infimum and supremum strings
  • Definition 1.4: infimum and supremum walks
  • Definition 1.5: Co-lex order
  • proof
  • Definition 1.7: Preceding pairs
  • Corollary 1.8
  • proof
  • Definition 1.9: Co-lex extension
  • Lemma 1.9
  • ...and 25 more