Correct and Optimal: the Regular Expression Inference Challenge

Mojtaba Valizadeh; Philip John Gorinski; Ignacio Iacobacci; Martin Berger

Correct and Optimal: the Regular Expression Inference Challenge

Mojtaba Valizadeh, Philip John Gorinski, Ignacio Iacobacci, Martin Berger

TL;DR

This work introduces Regular Expression Inference (REI) as a supervised ML challenge to learn minimal regular expressions from positive and negative examples given a cost function. It formalizes REI as REIC, provides four binary-alphabet datasets with varying operator sets and cost configurations, and leverages a GPU-based solver to generate ground-truth minimal REs. The authors establish an evaluation harness with metrics that balance correctness and optimisation, and present several baselines, including Trivial, PN/RE Retrieval, StarChatβ, and ReGPT, to illuminate the difficulty and landscape of the problem. The results indicate REI is hard for current learning approaches, especially to achieve minimality, and suggest promising directions that combine ML with algorithmic search for code-like synthesis tasks. They invite the community to participate in the REIC benchmarks to advance understanding of optimisation in ML-based code synthesis.

Abstract

We propose regular expression inference (REI) as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program optimisation task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings $P$ and $N$ and a cost function $cost(\cdot)$, the task is to generate an expression $r$ that accepts all strings in $P$ and rejects all strings in $N$, while no other such expression $r'$ exists with $cost(r')<cost(r)$. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g. $P$ or $N$ cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI, with its emphasis on optimisation, is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal regular expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to progress in code/language modelling.

Correct and Optimal: the Regular Expression Inference Challenge

TL;DR

Abstract

and

and a cost function

, the task is to generate an expression

that accepts all strings in

and rejects all strings in

, while no other such expression

exists with

. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g.

cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI, with its emphasis on optimisation, is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal regular expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to progress in code/language modelling.

Paper Structure (11 sections, 6 equations, 2 tables)

This paper contains 11 sections, 6 equations, 2 tables.

Introduction
Background & Related Work
The Regular Expression Inference Challenge
Datasets
Data Generation
REIC Metrics and Baselines
Challenge scoring.
Baselines.
ReGPT Training and Inference.
Discussion of Baseline Performance
Conclusions

Correct and Optimal: the Regular Expression Inference Challenge

TL;DR

Abstract

Correct and Optimal: the Regular Expression Inference Challenge

Authors

TL;DR

Abstract

Table of Contents