Table of Contents
Fetching ...

a1: Steep Test-time Scaling Law via Environment Augmented Generation

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, Xueqi Cheng

TL;DR

Environment Augmented Generation (EAG) integrates real-time environmental feedback, dynamic branch exploration, and trajectory-based learning to address hallucinations and errors in multi-step reasoning. By formulating reasoning as an interactive process within an MDP $(\mathcal{S}, \mathcal{A}, \mathcal{F}, \mathcal{T}, \mathcal{R})$ and training on the EAG-2K dataset of validated trajectories, EAG enables deliberate backtracking and plan refinement via external validation. Empirically, a1-32B achieves state-of-the-art results among 32B models across reasoning benchmarks, matching larger models on competitive mathematics and outperforming peers with substantial gains, while revealing a steep scaling pattern where initial environment interaction costs are outweighed by long-term dividends as task complexity grows. These findings offer a practical, parameter-efficient pathway to reliable machine reasoning by embedding feedback-driven exploration into the generation process.

Abstract

Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.

a1: Steep Test-time Scaling Law via Environment Augmented Generation

TL;DR

Environment Augmented Generation (EAG) integrates real-time environmental feedback, dynamic branch exploration, and trajectory-based learning to address hallucinations and errors in multi-step reasoning. By formulating reasoning as an interactive process within an MDP and training on the EAG-2K dataset of validated trajectories, EAG enables deliberate backtracking and plan refinement via external validation. Empirically, a1-32B achieves state-of-the-art results among 32B models across reasoning benchmarks, matching larger models on competitive mathematics and outperforming peers with substantial gains, while revealing a steep scaling pattern where initial environment interaction costs are outweighed by long-term dividends as task complexity grows. These findings offer a practical, parameter-efficient pathway to reliable machine reasoning by embedding feedback-driven exploration into the generation process.

Abstract

Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG's distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity. EAG's theoretical framework demonstrates how environment interactivity and systematic branch exploration together establish a new paradigm for reliable machine reasoning, particularly for problems requiring precise multi-step calculation and logical verification.

Paper Structure

This paper contains 41 sections, 4 theorems, 28 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Under Lipschitz continuity of information gain $\mathcal{I}$ and proper metric learning rate $\eta$, the EAG process converges to an $\epsilon$-optimal solution within $O(\frac{1}{\varepsilon^2}\log\frac{1}{\delta})$ steps with probability $1-\delta$.

Figures (6)

  • Figure 1: Illustration of the Environment Augmented Generation (EAG) framework solving a character counting task. The model explores multiple solution paths with instant feedback.
  • Figure 2: Model performance on MATH500 benchmark versus training data size. Dashed lines show scaling trends for a1. Our a1-32B achieves superior performance with fewer training examples compared to baseline models.
  • Figure 3: EAG framework. Left: branched state transition graph showing model navigation through states ($s_0, s_1, \ldots$) with information gain-guided decisions ($g > \tau$). Right: environmental interfaces providing real-time feedback ($\mathcal{E}$) for step validation. Green checkmarks and red crosses indicate successful and failed paths respectively.
  • Figure 4: Token length distribution analysis between s1K and EAG2K datasets. The violin plots (right) show the overall distribution shapes and ranges, with EAG2K exhibiting a higher median length and wider spread. The density plots (left) highlight the shift towards longer sequences in EAG2K, with peaks at approximately 6000 and 8000 tokens for s1K and EAG2K respectively.
  • Figure 5: Example of an iterative refinement cycle with execution, feedback, and correction.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 1: EAG Convergence
  • proof
  • Theorem 2: Linear Retry Approximation
  • proof
  • Theorem 3: EAG Convergence
  • proof
  • Theorem 4: Linear Retry Approximation
  • proof