Table of Contents
Fetching ...

ProAct: Agentic Lookahead in Interactive Environments

Yangbin Yu, Mingyu Yang, Junyou Li, Yiming Gao, Feiyu Liu, Yijun Yang, Zichuan Lin, Jiafei Lyu, Yicheng Liu, Zhicong Lu, Deheng Ye, Jie Jiang

TL;DR

ProAct tackles the challenge of long-horizon planning in interactive environments by splitting lookahead into a grounded supervision phase and an online RL refinement phase. Grounded LookAhead Distillation (GLAD) uses environment-based MCTS to generate future trajectories and compresses them into concise, explicit reasoning chains for training, mitigating inference-time search costs. The Monte-Carlo Critic (MC-Critic) provides low-variance value signals via lightweight Monte Carlo rollouts, enabling stable policy optimization when combined with PPO/GRPO in MC-PPO/MC-GRPO. Experiments on 2048 and Sokoban show a 4B-parameter ProAct model surpassing open-source baselines and rivaling closed-source models, with robust generalization to unseen variants. Overall, ProAct offers a scalable, effective approach to internalizing lookahead and stabilizing multi-turn RL for large language model agents in complex environments.

Abstract

Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct

ProAct: Agentic Lookahead in Interactive Environments

TL;DR

ProAct tackles the challenge of long-horizon planning in interactive environments by splitting lookahead into a grounded supervision phase and an online RL refinement phase. Grounded LookAhead Distillation (GLAD) uses environment-based MCTS to generate future trajectories and compresses them into concise, explicit reasoning chains for training, mitigating inference-time search costs. The Monte-Carlo Critic (MC-Critic) provides low-variance value signals via lightweight Monte Carlo rollouts, enabling stable policy optimization when combined with PPO/GRPO in MC-PPO/MC-GRPO. Experiments on 2048 and Sokoban show a 4B-parameter ProAct model surpassing open-source baselines and rivaling closed-source models, with robust generalization to unseen variants. Overall, ProAct offers a scalable, effective approach to internalizing lookahead and stabilizing multi-turn RL for large language model agents in complex environments.

Abstract

Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
Paper Structure (44 sections, 18 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 44 sections, 18 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of ProAct. A two-stage paradigm to internalize accurate lookahead reasoning for AI agents. GLAD distills complex MCTS search trees into concise, causal reasoning chains via SFT. MC-Critic leverages lightweight environment rollouts to provide low-variance value estimates, stabilizing online RL training.
  • Figure 2: The overall framework of ProAct which operates in two stages to internalize lookahead reasoning capabilities. Stage 1: Grounded Lookahead Distillation establishes the reasoning paradigm. It constructs high-quality lookahead search trees via environmental probing (MCTS) and distills these complex trajectories into compressed, explicit reasoning chains. Stage 2: Online Reinforcement Learning with Monte-Carlo Critic which refines the lookahead reasoning accuracy of agent using a Policy-Gradient framework (e.g., PPO or GRPO). We introduce a plug-and-play MC-Critic that provides low-variance value estimates by aggregating discounted returns from $M$ parallel trajectories generated by a lightweight random policy.
  • Figure 3: Case Study of GLAD on 2048. (a) Base model (Qwen3-4B-Instruct) and (b) GLAD-supervised model. Blue and red segments denote correct and incorrect intermediate analysis, respectively.
  • Figure 4: Comparison of baseline RL methods and MC-Critic when trained from GLAD SFT checkpoints on 2048 and Sokoban. MC-PPO and MC-GRPO are represented by "Step-PPO + MC-Critic" and "Step-GRPO + MC-Critic", respectively.
  • Figure 5: Comparison of baseline RL methods and MC-Critic when trained from scratch on 2048 and Sokoban. MC-PPO and MC-GRPO are represented by "Step-PPO + MC-Critic" and "Step-GRPO + MC-Critic", respectively.
  • ...and 3 more figures