Table of Contents
Fetching ...

Random Wheeler Automata

Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, Nicola Prezza

TL;DR

This work introduces a principled, uniform random-generation framework for Wheeler DFAs (WDFAs) by extending the Erdős-Rényi model to the WDFA class. It establishes a bijection between WDFAs and pairs $(O,I)$ encoding outgoing-label positions and in-degree blocks, enabling a linear-time, constant-space sampler (for suitable parameter ranges) that streams WFAs directly to output. It also provides exact counting formulas for the number of WDFAs and tight encoding-length bounds, and demonstrates a highly efficient C++ implementation with very high throughput. Together, these results supply both theoretical insight and practical tooling for generating large, uniform WDFA datasets to support empirical testing of Wheeler-automata algorithms and theory.

Abstract

Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in a constant number of bits per transition, as well as efficiently solving pattern matching queries on its accepted regular language. After their introduction, Wheeler automata have been the subject of a prolific line of research, both from the algorithmic and language-theoretic points of view. A recurring issue faced in these studies is the lack of large datasets of Wheeler automata on which the developed algorithms and theories could be tested. One possible way to overcome this issue is to generate random Wheeler automata. Motivated by this observation, in this paper we initiate the theoretical study of random Wheeler automata, focusing on the deterministic case (Wheeler DFAs -- WDFAs). We start by extending the Erdős-Rényi random graph model to WDFAs, and proceed by providing an algorithm generating uniform WDFAs according to this model. Our algorithm generates a uniform WDFA with $n$ states, $m$ transitions, and alphabet's cardinality $σ$ in $O(m)$ expected time ($O(m\log m)$ worst-case time w.h.p.) and constant working space for all alphabets of size $σ\le m/\ln m$. As a by-product, we also give formulas for the number of distinct WDFAs and obtain that $ nσ+ (n - σ) \log σ$ bits are necessary and sufficient to encode a WDFA with $n$ states and alphabet of size $σ$, up to an additive $Θ(n)$ term. We present an implementation of our algorithm and show that it is extremely fast in practice, with a throughput of over 8 million transitions per second.

Random Wheeler Automata

TL;DR

This work introduces a principled, uniform random-generation framework for Wheeler DFAs (WDFAs) by extending the Erdős-Rényi model to the WDFA class. It establishes a bijection between WDFAs and pairs encoding outgoing-label positions and in-degree blocks, enabling a linear-time, constant-space sampler (for suitable parameter ranges) that streams WFAs directly to output. It also provides exact counting formulas for the number of WDFAs and tight encoding-length bounds, and demonstrates a highly efficient C++ implementation with very high throughput. Together, these results supply both theoretical insight and practical tooling for generating large, uniform WDFA datasets to support empirical testing of Wheeler-automata algorithms and theory.

Abstract

Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in a constant number of bits per transition, as well as efficiently solving pattern matching queries on its accepted regular language. After their introduction, Wheeler automata have been the subject of a prolific line of research, both from the algorithmic and language-theoretic points of view. A recurring issue faced in these studies is the lack of large datasets of Wheeler automata on which the developed algorithms and theories could be tested. One possible way to overcome this issue is to generate random Wheeler automata. Motivated by this observation, in this paper we initiate the theoretical study of random Wheeler automata, focusing on the deterministic case (Wheeler DFAs -- WDFAs). We start by extending the Erdős-Rényi random graph model to WDFAs, and proceed by providing an algorithm generating uniform WDFAs according to this model. Our algorithm generates a uniform WDFA with states, transitions, and alphabet's cardinality in expected time ( worst-case time w.h.p.) and constant working space for all alphabets of size . As a by-product, we also give formulas for the number of distinct WDFAs and obtain that bits are necessary and sufficient to encode a WDFA with states and alphabet of size , up to an additive term. We present an implementation of our algorithm and show that it is extremely fast in practice, with a throughput of over 8 million transitions per second.
Paper Structure (18 sections, 12 theorems, 4 equations, 3 figures, 5 algorithms)

This paper contains 18 sections, 12 theorems, 4 equations, 3 figures, 5 algorithms.

Key Result

Theorem 1

There is an algorithm to generate a uniform WDFA from $\mathcal{D}_{n, m, \sigma}$ in $O(m)$ expected time ($O(m\log m)$ time with high probability) using $O(1)$ words of working space, for all alphabets of size $\sigma \le m/\ln m$. The output WDFA is directly streamed to the output as a set of lab

Figures (3)

  • Figure 1: Running example: a WDFA $D$ with $n=5$ states, $m=6$ edges, alphabet cardinality $\sigma=2$, and Wheeler order $1<2<3<4<5$. Note that the WDFA has two connected components.
  • Figure 2: Matrix $O$ (left) and bit-vector $I$ (right) forming the encoding $r(D)=(O,I)$ of the WDFA $D$ of Figure \ref{['fig:running ex D']}. In matrix $O$, column names are characters from $\Sigma=[\sigma]$ and row names are states from $Q=[n]$. In bit-vector $I$, each state (except state 1) is associated with a bit set, in Wheeler order. Cells containing a set bit are named with the name of the corresponding state. Bits in bold highlight the states on which the character that labels the state's incoming transitions changes (i.e. state 2 is the first whose incoming transitions are labeled 1, and state 3 is the first whose incoming transitions are labeled 2).
  • Figure 3: Wall clock time for generating random WDFAs using Algorithm \ref{['alg: constant space']}. Left: running time for the algorithm in case (1), i.e., streaming the resulting WDFAs to disk. Right: running time in case (2), i.e., storing WDFAs in internal memory.

Theorems & Definitions (20)

  • Theorem 1
  • Definition 2: Determinisitic Finite Automaton (DFA)
  • Definition 3: Wheeler DFA gagie:tcs17:wheeler
  • Definition 4
  • Definition 6
  • Remark 7
  • Remark 8
  • Definition 9
  • Definition 10
  • Lemma 11
  • ...and 10 more