Table of Contents
Fetching ...

Formal Verification for JavaScript Regular Expressions: a Proven Semantics and its Applications (Extended Version)

Aurèle Barrière, Victor Deng, Clément Pit-Claudel

TL;DR

This work delivers a mechanized, concise, and complete formal semantics for JavaScript regular expressions with backtracking, implemented in Rocq and proven faithful to the ECMAScript standard via a Warblre embedding. It advances the field by (i) introducing a backtracking-tree semantics that captures all possible matches and their priorities, (ii) defining a robust, directional notion of regex equivalence for local rewrites, (iii) proving a new formal definition of contextual equivalence, and (iv) providing the first formal verification of the PikeVM matching algorithm. The combined theory and mechanization enable rigorous verification of real-world regex optimizations and engines, with concrete artifacts and connections to industry-adopted specifications. Overall, the paper lays a solid foundation for verified, realistic regex engines that support modern JavaScript features such as backreferences, lookarounds, and capture groups, while enabling principled optimization and transformation.

Abstract

We present the first mechanized, succinct, practical, complete, and proven-faithful semantics for a modern regular expression language with backtracking semantics. We ensure its faithfulness by proving it equivalent to a preexisting line-by-line embedding of the official ECMAScript specification of JavaScript regular expressions. We demonstrate its practicality by presenting two real-world applications. First, a new notion of contextual equivalence for modern regular expressions, which we use to prove or disprove rewrites drawn from previous work. Second, the first formal proof of the PikeVM algorithm used in many real-world engines. In contrast with the specification and other formalization work, our semantics captures not only the top-priority match, but a full backtracking tree recording all possible matches and their respective priority. All our definitions and results have been mechanized in the Rocq proof assistant.

Formal Verification for JavaScript Regular Expressions: a Proven Semantics and its Applications (Extended Version)

TL;DR

This work delivers a mechanized, concise, and complete formal semantics for JavaScript regular expressions with backtracking, implemented in Rocq and proven faithful to the ECMAScript standard via a Warblre embedding. It advances the field by (i) introducing a backtracking-tree semantics that captures all possible matches and their priorities, (ii) defining a robust, directional notion of regex equivalence for local rewrites, (iii) proving a new formal definition of contextual equivalence, and (iv) providing the first formal verification of the PikeVM matching algorithm. The combined theory and mechanization enable rigorous verification of real-world regex optimizations and engines, with concrete artifacts and connections to industry-adopted specifications. Overall, the paper lays a solid foundation for verified, realistic regex engines that support modern JavaScript features such as backreferences, lookarounds, and capture groups, while enabling principled optimization and transformation.

Abstract

We present the first mechanized, succinct, practical, complete, and proven-faithful semantics for a modern regular expression language with backtracking semantics. We ensure its faithfulness by proving it equivalent to a preexisting line-by-line embedding of the official ECMAScript specification of JavaScript regular expressions. We demonstrate its practicality by presenting two real-world applications. First, a new notion of contextual equivalence for modern regular expressions, which we use to prove or disprove rewrites drawn from previous work. Second, the first formal proof of the PikeVM algorithm used in many real-world engines. In contrast with the specification and other formalization work, our semantics captures not only the top-priority match, but a full backtracking tree recording all possible matches and their respective priority. All our definitions and results have been mechanized in the Rocq proof assistant.

Paper Structure

This paper contains 42 sections, 16 theorems, 12 equations, 19 figures.

Key Result

Theorem 1

$\forall\: \mathit{l},\: \mathit{i},\: \mathit{gm},\: \mathit{d},\: \mathit{t}_1,\: \mathit{t}_2.\;$$(\mathit{l},\mathit{i},\mathit{gm},\mathit{d})\mathbin{\Downarrow}\mathit{t}_1 \wedge (\mathit{l},\mathit{i},\mathit{gm},\mathit{d})\mathbin{\Downarrow}\mathit{t}_2 \implies \mathit{t}_1 = \mathit{t}

Figures (19)

  • Figure 1: Abstract JavaScript regex syntax
  • Figure 2: Backtracking tree $\mathit{t}$ of the regex ${ \mathrm{\langle\colorbox{ex2}{$\mathrm{\newline a}$}|\langle\colorbox{ex4}{$\mathrm{\newline a(_{1} b)}$}|\colorbox{ex3}{$\mathrm{\newline a}$}\rangle\rangle\colorbox{ex1}{$\mathrm{\newline bc}$}}}$ on input "abbc"
  • Figure 3: Backtracking Trees
  • Figure 4: Inductive tree semantics
  • Figure 5: Correct regex equivalences --- associativity, distributivity and anchors
  • ...and 14 more figures

Theorems & Definitions (16)

  • Theorem 1: Determinism
  • Theorem 2: Termination
  • Theorem 3: Functional Semantics Correctness
  • Theorem 4: Faithfulness
  • Theorem 5: Bidirectional Equivalence
  • Theorem 6: Forward Equivalence
  • Theorem 7: Backward Equivalence
  • Theorem 8: Contextual to Observational Equivalence
  • Theorem 9: Correctness of the Boolean semantics
  • Theorem 10
  • ...and 6 more