PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Atharva Naik, Prakam, Darsh Agrawal, Yash Mathur, Manav Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, David Mortensen
TL;DR
PBEBench introduces a scalable, domain-agnostic benchmark for inductive reasoning by casting forward reconstruction as multi-step string-rewrite cascades, with two datasets PBEBench-Lite and PBEBench generated via a fully automated problem proposer. It formalizes instances as $\langle \vec{i}, \vec{p}, \vec{o} \rangle$, controlled by cascade length $L$ and relation-type vector $c_p \in \{0,1\}^4$, and evaluates LLMs using Pass@1 and Edit_Sim, accounting for test-time reasoning and constraints. The study shows that reasoning-enabled models outperform non-reasoning ones, but performance degrades sharply with cascade length; even the strongest open- and closed-source models exhibit limits under realistic forward-reconstruction-like tasks, though scaling strategies provide partial gains. The methodology enables contamination-free data generation and controllable difficulty, offering a practical path toward curriculum-style data for improving inductive reasoning in future models. The results highlight a meaningful gap between current capabilities and real-world forward reconstruction demands, motivating further exploration of scalable generation, structured reasoning, and more robust prompt-and-search strategies.
Abstract
Although many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs) within domains such as mathematics, coding, or data wrangling, few abstract away from domain specifics to examine reasoning as a capability in and of itself. We contribute a novel type of benchmark evaluating the inductive reasoning capabilities of LLMs that is inspired by the forward reconstruction task from historical linguistics but is formulated in an extremely simple, general way (in the form of Programming by Examples). The task involves generating a cascade of simple string rewrite programs to transform a given list of input strings into a list of desired output strings. We present a fully automated pipeline that programmatically generates problems of this type with controllable difficulty, enabling scalable evaluation of reasoning models while avoiding contamination. Using this approach, we construct two benchmarks: PBEBench-Lite, which efficiently stratifies models of varying capabilities, and PBEBench, which requires models to induce programs similar in complexity to those constructed by historical linguists. Our experiments reveal a substantial performance gap between models that leverage test-time compute or LCoT (long chain-of-thought) reasoning and those that do not. Moreover, although recent models show promise, the solve rate for both of them drops below 5% for hard instances of the PBEBench dataset (ground truth cascade lengths of 20 and 30, respectively), falling well short of realistic historical linguistics requirements even with computationally expensive, popular scaling techniques from the PBE and reasoning literature. Additionally, we also study the effectiveness of different scaling strategies and the impact of various hyperparameters on the difficulty of the generated data using gpt-oss-120b, the best-performing open-source model.
