Table of Contents
Fetching ...

A parallel parser for regular expressions

Angelo Borsotti, Luca Breveglieri, Stefano Crespi Reghizzi, Angelo Morzenti

TL;DR

The paper tackles the problem of extracting syntactic structure from regular expressions by introducing a parallel RE parser that outputs all possible syntax trees as a Shared Linearized Parse Forest (SLPF). It builds a serial parser that simulates a finite automaton to generate Linearized Syntax Trees (LSTs) and extends this with a parallel framework that splits input into chunks, leveraging a novel multi-entry DFA (ME-DFA) to reduce speculative work during parallel parsing. Key contributions include the ME-DFA-based reach phase, two-stage forward/backward parsing for clean SLPF construction, and substantial memory and time optimizations, yielding speedups up to about 24x on commodity multi-core hardware for long texts while maintaining linear memory scaling. The tool, implemented in Java, demonstrates practical viability with benchmarks across real-life and synthetic REs, and provides flexible functionality including full parsing, recognition, and match extraction. Overall, the work offers a scalable, precise approach to RE parsing with broad implications for advanced text querying and structured pattern discovery on large datasets.

Abstract

Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.

A parallel parser for regular expressions

TL;DR

The paper tackles the problem of extracting syntactic structure from regular expressions by introducing a parallel RE parser that outputs all possible syntax trees as a Shared Linearized Parse Forest (SLPF). It builds a serial parser that simulates a finite automaton to generate Linearized Syntax Trees (LSTs) and extends this with a parallel framework that splits input into chunks, leveraging a novel multi-entry DFA (ME-DFA) to reduce speculative work during parallel parsing. Key contributions include the ME-DFA-based reach phase, two-stage forward/backward parsing for clean SLPF construction, and substantial memory and time optimizations, yielding speedups up to about 24x on commodity multi-core hardware for long texts while maintaining linear memory scaling. The tool, implemented in Java, demonstrates practical viability with benchmarks across real-life and synthetic REs, and provides flexible functionality including full parsing, recognition, and match extraction. Overall, the work offers a scalable, precise approach to RE parsing with broad implications for advanced text querying and structured pattern discovery on large datasets.

Abstract

Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.

Paper Structure

This paper contains 77 sections, 2 theorems, 14 equations, 24 figures, 7 tables.

Key Result

proposition 1

The (regular) language generated by the numbered RE $e_\#$ is the set of the linearized syntax trees (LSTs) of the original RE $e$. ∎

Figures (24)

  • Figure 1: The RE $e_1$ with its functional (structure) tree and the two syntax trees of the valid string $a\,b\,a$ that are produced by a parser. The two syntax trees are represented as graphs and linearly as parenthesized strings.
  • Figure 2: Structure tree of RE $e_1$, and syntax trees of string $a\,b\,a$ in graphic and linearized form (the root subscript is omitted).
  • Figure 3: The parser NFA for RE $e_2$ of Ex. $2$. The NFA encodes in its states the language of the LSTs of RE $e_2$ (to be detailed in Tab. \ref{['tab:strings']}). The upper right box reproduces the classic GMY NFA that recognizes (but does not parse) the language $L \, (e_2)$; its states correspond to subsets of the parser NFA states.
  • Figure 4: Non-ambiguous RE $e_2$ with its table of operators and their numbering, structure tree, and list of all classic followers. The end-mark is omitted.
  • Figure 5: A recursive algorithm for computing all the segments for an RE, with the initial and final ones distinguished.
  • ...and 19 more figures

Theorems & Definitions (2)

  • proposition 1: language of LST
  • proposition 2: segment finiteness