A parallel parser for regular expressions

Angelo Borsotti; Luca Breveglieri; Stefano Crespi Reghizzi; Angelo Morzenti

A parallel parser for regular expressions

Angelo Borsotti, Luca Breveglieri, Stefano Crespi Reghizzi, Angelo Morzenti

TL;DR

The paper tackles the problem of extracting syntactic structure from regular expressions by introducing a parallel RE parser that outputs all possible syntax trees as a Shared Linearized Parse Forest (SLPF). It builds a serial parser that simulates a finite automaton to generate Linearized Syntax Trees (LSTs) and extends this with a parallel framework that splits input into chunks, leveraging a novel multi-entry DFA (ME-DFA) to reduce speculative work during parallel parsing. Key contributions include the ME-DFA-based reach phase, two-stage forward/backward parsing for clean SLPF construction, and substantial memory and time optimizations, yielding speedups up to about 24x on commodity multi-core hardware for long texts while maintaining linear memory scaling. The tool, implemented in Java, demonstrates practical viability with benchmarks across real-life and synthetic REs, and provides flexible functionality including full parsing, recognition, and match extraction. Overall, the work offers a scalable, precise approach to RE parsing with broad implications for advanced text querying and structured pattern discovery on large datasets.

Abstract

Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.

A parallel parser for regular expressions

TL;DR

Abstract

A parallel parser for regular expressions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (2)