Table of Contents
Fetching ...

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

Bowen Liu, Zhi Wu, Runquan Xie, Zhanhui Kang, Jia Li

TL;DR

SSLogic is proposed, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty.

Abstract

Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

TL;DR

SSLogic is proposed, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty.

Abstract

Scaling verifiable training signals remains a key bottleneck for Reinforcement Learning from Verifiable Rewards (RLVR). Logical reasoning is a natural substrate: constraints are formal and answers are programmatically checkable. However, prior synthesis pipelines either depend on expert-written code or operate within fixed templates/skeletons, which limits growth largely to instance-level perturbations. We propose SSLogic, an agentic meta-synthesis framework that scales at the task-family level by iteratively synthesizing and repairing executable Generator--Validator program pairs in a closed Generate--Validate--Repair loop, enabling continuous family evolution with controllable difficulty. To ensure reliability, we introduce a Multi-Gate Validation Protocol that combines multi-strategy consistency checks with Adversarial Blind Review, where independent agents must solve instances by writing and executing code to filter ambiguous or ill-posed tasks. Starting from 400 seed families, two evolution rounds expand to 953 families and 21,389 verifiable instances (from 5,718). Training on SSLogic-evolved data yields consistent gains over the seed baseline at matched training steps, improving SynLogic by +5.2, BBEH by +1.4, AIME25 by +3.0, and Brumo25 by +3.7.
Paper Structure (80 sections, 4 equations, 27 figures, 17 tables)

This paper contains 80 sections, 4 equations, 27 figures, 17 tables.

Figures (27)

  • Figure 1: Paradigm Shifts in Logic Data Generation: From Manual Curation to Agentic Meta-Synthesis. Left: Traditional Manual Curation focuses on Task/QA pairs, where quality control and feedback rely heavily on humans. Middle: Code Synthesis introduces executable Generators/Validators, achieving partial automation but still requiring manual oversight. Right: Our Agentic Meta-Synthesis enables fully automatic, end-to-end data production. Agents iteratively generate and validate task families (Generator + Validator) and instances, realizing the path from Manual $\rightarrow$ Semi-Automatic $\rightarrow$ Full-Automatic construction (Scaling the Scaling Logic).
  • Figure 2: Overview of the Multi-Gate Agentic Meta-Synthesis Framework. The Main Agent operates in a three-phase closed loop: Task Synthesis (Phase I), screening via Quality Agent Gates and Consensus-based Validation (including Blind Review) (Phase II), and Abductive Debugging for failures with Experience Updates, finally delivering Generators/Validators, templates, and data (Phase III).
  • Figure 3: Evolution of reflection-like token frequency across different training settings.
  • Figure 4: Average response length dynamics during training.
  • Figure 5: Difficulty controllability. Pass@1 accuracy between Seed and Evolved tasks at $D \in \{5, 7, 10\}$ on DeepSeek-V3.1-Terminus and Doubao-1.6-Thinking. The curves decrease monotonically and closely track each other, with error bars shown.
  • ...and 22 more figures