A closer look at TDFA
Angelo Borsotti, Ulya Trafimovich
TL;DR
The paper introduces tagged deterministic finite automata (TDFA) for efficient submatch extraction from regular expressions, combining tags with registers to determinize NFAs while preserving submatch information. It formalizes both TNFA and TDFA, and presents practical variants for ahead-of-time (AOT) and just-in-time (JIT) determinization, including a comprehensive set of optimizations (multi-valued and fixed tags, fallback handling, register optimization, and minimization) along with a multi-pass TDFA approach for dense submatch scenarios. Detailed pseudocode, step-by-step construction examples, and benchmarks demonstrate that TDFA (especially TDFA(1)) is competitive or superior to alternative approaches across real and synthetic REs, with RE2C and RE2CJava implementations validating practicality. The work emphasizes practical guidance for implementation and optimization to enable fast, correct submatch parsing in real-world tools.
Abstract
We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
