On the Complexity of Computing the Co-lexicographic Width of a Regular Language

Ruben Becker; Davide Cenzato; Sung-Hwan Kim; Tomasz Kociumaka; Bojana Kodric; Alberto Policriti; Nicola Prezza

On the Complexity of Computing the Co-lexicographic Width of a Regular Language

Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Tomasz Kociumaka, Bojana Kodric, Alberto Policriti, Nicola Prezza

TL;DR

The problem is known to be PSPACE-complete when the input is an NFA (D'Agostino et al., Theoretical Computer Science 2023); thus, together with that result, this paper essentially settles the complexity of the problem.

Abstract

Co-lex partial orders were recently introduced in (Cotumaccio et al., SODA 2021 and JACM 2023) as a powerful tool to index finite state automata, with applications to regular expression matching. They generalize Wheeler orders (Gagie et al., Theoretical Computer Science 2017) and naturally reflect the co-lexicographic order of the strings labeling source-to-node paths in the automaton. Briefly, the co-lex width $p$ of a finite-state automaton measures how sortable its states are with respect to the co-lex order among the strings they accept. Automata of co-lex width $p$ can be compressed to $O(\log p)$ bits per edge and admit regular expression matching algorithms running in time proportional to $p^2$ per matched character. The deterministic co-lex width of a regular language $\mathcal L$ is the smallest width of such a co-lex order, among all DFAs recognizing $\mathcal L$. Since languages of small co-lex width admit efficient solutions to automata compression and pattern matching, computing the co-lex width of a language is relevant in these applications. The paper introducing co-lex orders determined that the deterministic co-lex width $p$ of a language $\mathcal L$ can be computed in time proportional to $m^{O(p)}$, given as input any DFA $\mathcal A$ for $\mathcal L$, of size (number of transitions) $m =|\mathcal A|$. In this paper, using new techniques, we show that it is possible to decide in $O(m^p)$ time if the deterministic co-lex width of the language recognized by a given minimum DFA is strictly smaller than some integer $p\ge 2$. We complement this upper bound with a matching conditional lower bound based on the Strong Exponential Time Hypothesis. The problem is known to be PSPACE-complete when the input is an NFA (D'Agostino et al., Theoretical Computer Science 2023); thus, together with that result, our paper essentially settles the complexity of the problem.

On the Complexity of Computing the Co-lexicographic Width of a Regular Language

TL;DR

Abstract

of a finite-state automaton measures how sortable its states are with respect to the co-lex order among the strings they accept. Automata of co-lex width

can be compressed to

bits per edge and admit regular expression matching algorithms running in time proportional to

per matched character. The deterministic co-lex width of a regular language

is the smallest width of such a co-lex order, among all DFAs recognizing

. Since languages of small co-lex width admit efficient solutions to automata compression and pattern matching, computing the co-lex width of a language is relevant in these applications. The paper introducing co-lex orders determined that the deterministic co-lex width

of a language

can be computed in time proportional to

, given as input any DFA

for

, of size (number of transitions)

. In this paper, using new techniques, we show that it is possible to decide in

time if the deterministic co-lex width of the language recognized by a given minimum DFA is strictly smaller than some integer

. We complement this upper bound with a matching conditional lower bound based on the Strong Exponential Time Hypothesis. The problem is known to be PSPACE-complete when the input is an NFA (D'Agostino et al., Theoretical Computer Science 2023); thus, together with that result, our paper essentially settles the complexity of the problem.

Paper Structure (32 sections, 14 theorems, 7 equations, 7 figures, 1 table)

This paper contains 32 sections, 14 theorems, 7 equations, 7 figures, 1 table.

Introduction
Preliminaries, Problems, and State of the Art
Model of computation
Intervals and Strings
Randomization Techniques
DFAs, Wheeler DFAs, and Co-Lex Width
Computational Problems and State of The Art
Entanglement of a Regular Language
A new Characterization of the Deterministic Co-Lex Width
Algorithms for the DFADetWidth Problem
A Simple Optimal-Time Algorithm
Computing the intervals $\mathcal{I}(u)$
Testing acyclicity of $\mathcal{B}$
Optimized Algorithm
Algorithm description
...and 17 more sections

Key Result

lemma 1

Given any DFA $\mathcal{A} = (Q, \Sigma, \delta, s, F)$, the co-lex order $<$ such that $\mathop{\mathrm{width}}\nolimits(\mathcal{A}) = \mathop{\mathrm{width}}\nolimits(<)$ is such that, for any two states $u,v\in Q$:

Figures (7)

Figure 1: Interval representation of infima and suprema strings of a (minimum) DFA.
Figure 2: DFA $\mathcal{A}'$ with $\mathop{\mathrm{width}}\nolimits(\mathcal{A}')$=3 and $\mathcal{L}(\mathcal{A}')=\mathcal{L}(\mathcal{A})$ where $\mathcal{A}$ is the DFA in Figure \ref{['fig:dfawidth']}, which is a certificate of $\mathop{\mathrm{width}}\nolimits(\mathcal{L}(\mathcal{A}))<4$.
Figure 3: DFA $\mathcal{A}"$ with $\mathop{\mathrm{width}}\nolimits(\mathcal{A}")$=2 and $\mathcal{L}(\mathcal{A}")=\mathcal{L}(\mathcal{A})$ where $\mathcal{A}$ is as in Figure \ref{['fig:dfawidth']}.
Figure 4: Pruned power semi-DFA $\mathcal{B}$ of Definition \ref{['def:semi-DFA B']} constructed from the automaton $\mathcal{A}_{\min}$ of Figure \ref{['fig:dfawidth']} for $p=2$ (a) and $p=3$ (b). By Theorem \ref{['thm: main: width - cycle']} and by Definition \ref{['def:semi-DFA B']}, we obtain that $\mathop{\mathrm{width}}\nolimits^D(\mathcal{L}(\mathcal{A}_{\min}))=2$ since (a) contains a cycle (i.e., $\mathop{\mathrm{width}}\nolimits^D(\mathcal{L}(\mathcal{A}))\ge 2$) while (b) is acyclic (i.e., $\mathop{\mathrm{width}}\nolimits^D(\mathcal{L}(\mathcal{A}_{\min}))<3$).
Figure 5: The construction of the subgraphs in the set $C$ for $p=2$ for a 3-SAT formula $\Phi = (x_1 \vee \overline{x_2} \vee x_4) \,\wedge\,(\overline{x_1} \vee x_3 \vee \overline{x_4}) \,\wedge\, (x_2 \vee \overline{x_3} \vee \overline{x_4})\,\wedge\, (\overline{x_1} \vee x_2 \vee \overline{x_4})$ with $N=4$ variables and $M=4$ clauses. There are $p=2$ blocks, the first block consists of variables $x_1, x_2$, while the second block consists of variables $x_3, x_4$. Each block contains one subgraph for each of the $T = 2^{N/p} = 2^{4/2} = 4$ assignments. For better readability, we omit the transition labeled $\#$ from $v_{8}^{r, u}$ to $v_0^{r, u}$ for each block $r$ and assignment $u$. We observe that the assignment $x_1=1, x_2=0, x_3 = 1, x_4=0$ satisfies $\Phi$. Consequently, we find $p=2$ equally labeled cycles in the subgraphs $S_{1, 10}$ and $S_{2, 10}$. The string that is spelled by these cycles is equal to $\alpha = 1233\,1010\#$. Character $h$ at position $j$ (for $j\in [M]=[4]$) in this string indicates that clause $C_j$ is made valid by an assignment $u$ to the $h$'th literal of this clause. The assignment $u$ is that one for which we find the cycle in its corresponding subgraph. For example, the first $1$ in the string indicates that $C_1$ is made true by an assignment to its first literal; as we found the cycle labeled $\alpha$ in $S_{1, 10}$, we can conclude that the assignment to the variables of the first block that makes $C_1$ true is $10$. Indeed, the literal that makes $C_1$ true is $x_1$. For $\ell = M + i\in [M + 1, M + N]$ instead, the string $\alpha$ spells the assignment to variable $x_i$.
...and 2 more figures

Theorems & Definitions (39)

definition 1: Karp-Rabin hashing KR
definition 2: DFA
definition 3: semi-DFA
definition 4: Power semi-DFA
definition 5
definition 6: Infimum and supremum strings alanko_et_al:LIPIcs.CPM.2024.1
definition 7
definition 8: Co-lex Order CotumaccioJACM23
definition 9: Co-lex Width
lemma 1: Thm. 10 of KimOP23
...and 29 more

On the Complexity of Computing the Co-lexicographic Width of a Regular Language

TL;DR

Abstract

On the Complexity of Computing the Co-lexicographic Width of a Regular Language

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (39)