Table of Contents
Fetching ...

A closer look at TDFA

Angelo Borsotti, Ulya Trafimovich

TL;DR

The paper introduces tagged deterministic finite automata (TDFA) for efficient submatch extraction from regular expressions, combining tags with registers to determinize NFAs while preserving submatch information. It formalizes both TNFA and TDFA, and presents practical variants for ahead-of-time (AOT) and just-in-time (JIT) determinization, including a comprehensive set of optimizations (multi-valued and fixed tags, fallback handling, register optimization, and minimization) along with a multi-pass TDFA approach for dense submatch scenarios. Detailed pseudocode, step-by-step construction examples, and benchmarks demonstrate that TDFA (especially TDFA(1)) is competitive or superior to alternative approaches across real and synthetic REs, with RE2C and RE2CJava implementations validating practicality. The work emphasizes practical guidance for implementation and optimization to enable fast, correct submatch parsing in real-world tools.

Abstract

We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.

A closer look at TDFA

TL;DR

The paper introduces tagged deterministic finite automata (TDFA) for efficient submatch extraction from regular expressions, combining tags with registers to determinize NFAs while preserving submatch information. It formalizes both TNFA and TDFA, and presents practical variants for ahead-of-time (AOT) and just-in-time (JIT) determinization, including a comprehensive set of optimizations (multi-valued and fixed tags, fallback handling, register optimization, and minimization) along with a multi-pass TDFA approach for dense submatch scenarios. Detailed pseudocode, step-by-step construction examples, and benchmarks demonstrate that TDFA (especially TDFA(1)) is competitive or superior to alternative approaches across real and synthetic REs, with RE2C and RE2CJava implementations validating practicality. The work emphasizes practical guidance for implementation and optimization to enable fast, correct submatch parsing in real-world tools.

Abstract

We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
Paper Structure (12 sections, 11 figures, 8 algorithms)

This paper contains 12 sections, 11 figures, 8 algorithms.

Figures (11)

  • Figure 1: Example for a RE $(1a2)^*3(a|4b)5b^*$: TNFA, simulation on string $aab$, determinization, TDFA.
  • Figure 2: Register optimizations for TDFA on figure \ref{['fig:tdfa']}. Top to bottom: initial CFG, CFG after compaction with per-block liveness information and interference table, CFG on the second round of optimizations, optimized TDFA with final registers $r_1$ to $r_5$.
  • Figure 3: Optimized TDFA with fixed tags $t_1 \leftarrow (\mathbf{n} \text{ if } t_2 = \mathbf{n} \text{ else } t_{2} - 1)$ and $t_3 \leftarrow (t_5 - 1)$. Tags $t_2$, $t_4$, $t_5$ correspond to final registers $r_1$, $r_2$, $r_3$.
  • Figure 4: Multi-pass TDFA for RE $(1a2)^*3(a|4b)5b^*$ matching string $aab$.
  • Figure 5: Benchmarks for AOT determinization, real-world REs.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3