Table of Contents
Fetching ...

ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

Seongmin Kim, Hyunjoon Cheon, Su-Hyeon Kim, Yo-Sub Han, Sang-Ki Ko

Abstract

Existing Programming-By-Example (PBE) systems often rely on simplified benchmarks that fail to capture the high structural complexity-such as deeper nesting and frequent Unions-of real-world regexes. To overcome the resulting performance drop, we propose ReSyn, a synthesizer-agnostic divide-and-conquer framework that decomposes complex synthesis problems into manageable sub-problems. We also introduce Set2Regex, a parameter-efficient synthesizer capturing the permutation invariance of examples. Experimental results demonstrate that ReSyn significantly boosts accuracy across various synthesizers, and its combination with Set2Regex establishes a new state-of-the-art on challenging real-world benchmark.

ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

Abstract

Existing Programming-By-Example (PBE) systems often rely on simplified benchmarks that fail to capture the high structural complexity-such as deeper nesting and frequent Unions-of real-world regexes. To overcome the resulting performance drop, we propose ReSyn, a synthesizer-agnostic divide-and-conquer framework that decomposes complex synthesis problems into manageable sub-problems. We also introduce Set2Regex, a parameter-efficient synthesizer capturing the permutation invariance of examples. Experimental results demonstrate that ReSyn significantly boosts accuracy across various synthesizers, and its combination with Set2Regex establishes a new state-of-the-art on challenging real-world benchmark.

Paper Structure

This paper contains 57 sections, 8 theorems, 2 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

For any finite language $S$, the language expression cost $c_{E}(S)$ is equal to the optimal alignment cost $c(S)$.

Figures (6)

  • Figure 1: The architecture of Set2Regex. The Hierarchical Set Encoder aggregates character features into string embeddings ($h_i$) and subsequently into a global context ($c$) via PMA to ensure permutation invariance. The decoder employs a dual-attention mechanism, attending first to $c$ for global structure and then to $\{h'_i\}$ for local details.
  • Figure 2: Structural Complexity Comparison across Benchmarks. The top row displays the distribution of Top-level Operators; while Structured-Regex and Snort are dominated by Concatenations, RegExLib exhibits a diverse structural composition with significant use of Union operators. The bottom row illustrates the AST Depth distribution using a sequential color gradient (darker indicates deeper nesting). Notably, RegExLib contains a significant proportion of high-complexity instances (Depth $\ge$ 5), whereas the other benchmarks are concentrated in shallower regions.
  • Figure 3: Performance comparison across regex AST depth. Non-recursive methods ( Forest, Split-Regex) show sharp degradation with increasing depth, while Recursive maintains robust performance. The gap widens beyond depth 4, highlighting the necessity of recursive decomposition for complex real-world regexes.
  • Figure 4: Example alignment of two strings 'http' and 'ftps'
  • Figure 5: A running example of the Re-Syn framework process. The input positive set $P$ is first decomposed via Segmentation into three logical components ($P^{(1)}$: User, $P^{(2)}$: Domain, $P^{(3)}$: TLD), mirroring a Concatenation structure. Subsequently, the Router recursively determines the strategy for each subset. For instance, $P^{(1)}$ and $P^{(3)}$ are further decomposed via Partitioning (Union), while non-decomposable components are synthesized directly. Finally, the partial regexes are reconstructed bottom-up to form the complete regex. Note that this example is simplified for illustrative purposes.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Lemma 1
  • Lemma 2
  • proof : Proof (Sketch)
  • Theorem 1
  • Definition 1: Expression cost
  • Example 1
  • Definition 2: Language expression cost
  • Definition 3: String decomposition cost
  • Example 2
  • Definition 4: Alignment cost
  • ...and 10 more