Table of Contents
Fetching ...

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han

TL;DR

RegexPSPACE introduces the first benchmark for evaluating LLMs and LRMs on PSPACE-complete regex problems, focusing on RegexEq and RegexMin. It constructs large-scale labeled and unlabeled regex datasets via massive space exploration and sound filtering, enabling quantitative evaluation with metrics for minimality, equivalence, and length as well as accuracy and F1 for decision tasks. The study reveals a substantial gap between theoretical computational power and practical performance, with minimization being markedly harder than equivalence and prone to repetition and verbosity under limited context. The framework provides a rigorous, extensible platform and release-ready code to drive future progress in reasoning under spatial constraints.

Abstract

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

TL;DR

RegexPSPACE introduces the first benchmark for evaluating LLMs and LRMs on PSPACE-complete regex problems, focusing on RegexEq and RegexMin. It constructs large-scale labeled and unlabeled regex datasets via massive space exploration and sound filtering, enabling quantitative evaluation with metrics for minimality, equivalence, and length as well as accuracy and F1 for decision tasks. The study reveals a substantial gap between theoretical computational power and practical performance, with minimization being markedly harder than equivalence and prone to repetition and verbosity under limited context. The framework provides a rigorous, extensible platform and release-ready code to drive future progress in reasoning under spatial constraints.

Abstract

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .

Paper Structure

This paper contains 80 sections, 6 theorems, 12 equations, 14 figures, 12 tables, 6 algorithms.

Key Result

Lemma D.1

Given an integer $n$ and an alphabet $\Sigma$, the set $A_n$ constructed by algorithm regmin:app:alg:buildAn contains all possible regular expressions on $\Sigma$, when considering the equivalence relation as the same.

Figures (14)

  • Figure 1: Overview diagram for the complexity class. In this work, we target PSPACE-complete problems, a class that has received relatively little exploration so far. The papers cited in the figure are as follows: FanHLLZ24 [FanHLLZ24], FanHLZJLLCWMZ24 [FanHLZJLLCWMZ24], SubramanianKSMP25 [SubramanianKSMP25], and TangZLCL25 [TangZLCL25]
  • Figure 2: Overview of our work. We construct the labeled regex dataset (LRD) and the unlabeled regex minimization test set (URMT) and label LRD with the massive partitioning of regexes. The stars on the 3D graph visualize the number of regexes in the dataset and the number of regexes to examine for calculating minimality. We construct RegexPSPACE by filtering the test set of LRD and evaluate LLMs and LRMs on our benchmark.
  • Figure 3: Case analysis bar chart of the zero-shot prompting results on RegexMin. The outcomes are categorized into Minimality, Not minimal but equivalent, Not equivalent but valid, Invalid but completed answers, Repetition, and Incomplete outputs.
  • Figure 4: Case analysis bar chart of the zero-shot prompting results on RegexEq. The outcomes are categorized into the components of the confusion matrix, together with Invalid but completed answers, Repetition, and Incomplete outputs.
  • Figure 5: The overview of our dataset construction and evaluation. Our dataset consists of regex minimization corpus (LRD) and extended regex minimization benchmark (URMT), which are constructed using a bottom-up approach over tree depth. LRD is labeled using the minimal tree length calculated in Section \ref{['regmin:app:ssec:LRD_construction']}.
  • ...and 9 more figures

Theorems & Definitions (12)

  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Lemma D.3
  • proof
  • Corollary D.4
  • proof
  • Corollary D.5
  • proof
  • ...and 2 more