Table of Contents
Fetching ...

CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua

TL;DR

The paper tackles inefficiencies in multi-step, knowledge-intensive tasks by challenging the assumption that all actions equally contribute to outcomes. It introduces CARL, a critical-action-focused reinforcement learning framework that identifies high-criticality actions via state entropy and concentrates rollout, action-level rewards, and updates on those actions using an entropy-guided progressive rollout with selective updates. Empirical results across reasoning and non-reasoning models show that CARL improves performance while substantially reducing training and inference costs compared to group-level policy optimization. The approach preserves exploration through higher policy entropy and demonstrates strong efficiency gains on knowledge-intensive QA benchmarks, suggesting practical value for large, multi-turn agents in real-world settings.

Abstract

Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.

CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

TL;DR

The paper tackles inefficiencies in multi-step, knowledge-intensive tasks by challenging the assumption that all actions equally contribute to outcomes. It introduces CARL, a critical-action-focused reinforcement learning framework that identifies high-criticality actions via state entropy and concentrates rollout, action-level rewards, and updates on those actions using an entropy-guided progressive rollout with selective updates. Empirical results across reasoning and non-reasoning models show that CARL improves performance while substantially reducing training and inference costs compared to group-level policy optimization. The approach preserves exploration through higher policy entropy and demonstrates strong efficiency gains on knowledge-intensive QA benchmarks, suggesting practical value for large, multi-turn agents in real-world settings.

Abstract

Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.

Paper Structure

This paper contains 26 sections, 13 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: CARL performs focused reinforcement learning on actions with high criticality. Therefore, it delivers higher performance with lower training and inference costs than GRPO.
  • Figure 2: Quantitative Analysis of Execution Pipeline. (a) Most actions yield low reward variance when resampled, while only a small subset exhibits notably high variance. (b) The states corresponding to high-criticality actions show higher entropy than those associated with low-criticality actions.
  • Figure 3: CARL Algorithm. In the rollout phase, CARL progressively forks critical actions and provides action-level guidance for them through an expected-reward-gain formulation. The expected reward of each state is estimated by averaging its successor states, and the advantage of an action is computed as the difference between the terminal state and the initial state. In the update phase, low-criticality actions are excluded to further improve efficiency.
  • Figure 4: Comparison of Entropy between CARL and GRPO. CARL maintains consistently higher entropy than GRPO during training and evaluation, indicating stronger exploration capability.