SLR: Automated Synthesis for Scalable Logical Reasoning
Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting
TL;DR
The paper tackles scalable evaluation and training of inductive logical reasoning in LLMs by introducing SLR, an automated framework that synthesizes tasks, latent ground-truth rules, and verifiable rewards under an ILP-inspired Learning-from-Entailment formalism. It articulates a three-stage pipeline—task specification, synthesis, and evaluation/training—and demonstrates SLR-Bench, a 19k-task, 20-level curriculum that reveals strong syntactic rule generation but persistent semantic inference challenges, while enabling curriculum-driven performance gains and cross-domain generalization. Empirically, curriculum-tuned models achieve substantial improvements on SLR-Bench and downstream benchmarks at lower inference costs than purely reasoning-focused systems, underscoring curriculum learning as a key driver of robust inductive reasoning. Overall, SLR offers a scalable, reproducible foundation for evaluating and enhancing inductive logical reasoning in AI, with broad implications for scientific reasoning, program synthesis, and trustworthy AI.
Abstract
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
