Agentic Critical Training

Weize Liu; Minghui Liu; Sy-Tuyen Ho; Souradip Chakraborty; Xiyao Wang; Furong Huang

Agentic Critical Training

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang

TL;DR

The results suggest that ACT is a promising path toward developing more reflective and capable LLM agents, and enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data.

Abstract

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

Agentic Critical Training

TL;DR

Abstract

Paper Structure (45 sections, 8 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 45 sections, 8 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Agentic Critical Training
Problem Formulation
Data Construction
Training Pipeline
Agentic Critical Training.
RL Action Training.
Reward Design.
Related Work
LLM-based Agents.
Training LLM Agents.
Critique RL Training.
Agentic RL.
Experiments
Experimental Setup
...and 30 more sections

Figures (13)

Figure 1: Comparison of imitated vs. genuine self-reflection. (a) Early Experience executes both actions in the environment, generates a reflection from the resulting states, and trains the model to imitate this fixed text via supervised fine-tuning (SFT). (b) ACT presents two candidate actions and trains the model via RL to select the better one. Since only the selection outcome is rewarded, the model must autonomously develop reasoning about action quality to maximize reward.
Figure 2: Overview of the ACT + RL training pipeline. Stage 1 (Data Construction): Given expert demonstration trajectories, we extract state-action pairs and sample alternative actions from the initial policy $\pi_{\theta_0}$ at each state. Expert actions are paired with model-generated alternatives to construct contrastive training examples. Stage 2 (Agentic Critical Training): The model is trained via GRPO to identify the better action among candidates presented in randomized order, internalizing an understanding of action quality through verifiable rewards. Stage 3 (RL Action Training): The ACT-enhanced model is further trained with RL for direct action generation, leveraging its improved critical reasoning foundation to achieve higher task success rates.
Figure 3: Failure recovery on ALFWorld. Left: The IL model enters an infinite loop, repeating a failed action for over 30 steps until termination. Right: The ACT model encounters the same type of failure but uses its internal reasoning to diagnose the root cause (wrong location), break the loop, and issue the correct navigation command.
Figure 4: Self-verification behavior observed in ACT on GPQA-Diamond. After deriving the kinetic energies, the ACT model substitutes each answer option back into the energy conservation equation, eliminating inconsistent options.
Figure 5: The ACT prompt for ALFWorld. The model is presented with the full context followed by two candidate actions and is asked to select the better one with reasoning.
...and 8 more figures

Agentic Critical Training

TL;DR

Abstract

Agentic Critical Training

Authors

TL;DR

Abstract

Table of Contents

Figures (13)