Table of Contents
Fetching ...

SLR: Automated Synthesis for Scalable Logical Reasoning

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting

TL;DR

The paper tackles scalable evaluation and training of inductive logical reasoning in LLMs by introducing SLR, an automated framework that synthesizes tasks, latent ground-truth rules, and verifiable rewards under an ILP-inspired Learning-from-Entailment formalism. It articulates a three-stage pipeline—task specification, synthesis, and evaluation/training—and demonstrates SLR-Bench, a 19k-task, 20-level curriculum that reveals strong syntactic rule generation but persistent semantic inference challenges, while enabling curriculum-driven performance gains and cross-domain generalization. Empirically, curriculum-tuned models achieve substantial improvements on SLR-Bench and downstream benchmarks at lower inference costs than purely reasoning-focused systems, underscoring curriculum learning as a key driver of robust inductive reasoning. Overall, SLR offers a scalable, reproducible foundation for evaluating and enhancing inductive logical reasoning in AI, with broad implications for scientific reasoning, program synthesis, and trustworthy AI.

Abstract

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

SLR: Automated Synthesis for Scalable Logical Reasoning

TL;DR

The paper tackles scalable evaluation and training of inductive logical reasoning in LLMs by introducing SLR, an automated framework that synthesizes tasks, latent ground-truth rules, and verifiable rewards under an ILP-inspired Learning-from-Entailment formalism. It articulates a three-stage pipeline—task specification, synthesis, and evaluation/training—and demonstrates SLR-Bench, a 19k-task, 20-level curriculum that reveals strong syntactic rule generation but persistent semantic inference challenges, while enabling curriculum-driven performance gains and cross-domain generalization. Empirically, curriculum-tuned models achieve substantial improvements on SLR-Bench and downstream benchmarks at lower inference costs than purely reasoning-focused systems, underscoring curriculum learning as a key driver of robust inductive reasoning. Overall, SLR offers a scalable, reproducible foundation for evaluating and enhancing inductive logical reasoning in AI, with broad implications for scientific reasoning, program synthesis, and trustworthy AI.

Abstract

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

Paper Structure

This paper contains 48 sections, 3 equations, 6 figures, 16 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the SLR Framework, including task specification, automated task synthesis, training, and evaluation. Left (blue): Language defines vocabulary and grammar, Task Config specifies configuration parameters for the synthesis. Middle (green): The task synthesizer automatically generates ground-truth rules, validation programs, and instruction prompts. Right (purple): Training LLMs on logic tasks via SFT (cross-entropy) or RL (symbolic judge feedback). Right (orange): Evaluates LLMs using feedback provided by the symbolic judge. Arrows denote data and control flow through synthesis, prompting, evaluation, and downstream training loops.
  • Figure 2: Verifiable Logic Rewards: A candidate hypothesis is evaluated by executing it against the validation program. It outputs three metrics: syntactic validity (binary), perfect task completion (binary), and a partial score for the fraction of correctly classified examples.
  • Figure 3: Overview of SLR-Bench: The benchmark curriculum spans from basic to hard tasks with increasing logical and combinatorial complexity (bars, right y-axis). As logical complexity increases, Model performance (red, left y-axis) declines, highlighting current LLMs' limitations on more challenging reasoning tasks.
  • Figure 4: Illustrative prompt and ground-truth rule generated by SLR (Level 1, SLR-Bench). Language ($\mathcal{L}$): 5 predicates, 1 car variable per train. Task configuration ($\Theta$): $\kappa = (1,1)$ (one positive and one negative example); $B_\pi =$ mirror; $R_{\text{len}} = 1$; $R_{\text{sample}} =$ uniform. The prompt provides the full ILP instance, including background $B$, positive/negative examples ($E^+, E^-$), and natural-language instructions for the learning task.
  • Figure 5: Compute–Performance Trade-Off. Reasoning LLMs achieve higher accuracy than base LLMs but require more compute. While more complex tasks typically demand more completion tokens, increased compute does not always translate to higher accuracy.
  • ...and 1 more figures