Table of Contents
Fetching ...

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

TL;DR

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability.

Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

TL;DR

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability.

Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.
Paper Structure (22 sections, 3 equations, 3 figures, 3 tables)

This paper contains 22 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our method, SELAUR. (a) Uncertainty Estimation: We employ three approaches to estimate model uncertainty: entropy, confidence, and margin, derived from token probability distributions. (b) Step and Trajectory Aggregation: Throughout the model’s process, we compute two types of rewards that contribute to the uncertainty reward. At the step level, token-level uncertainties are aggregated. At the trajectory level, step-level uncertainties are combined with varying weights, where later steps receive higher weights in the overall trajectory reward. (c) Failure-aware Reward Reshaping: For successful cases, the model is trained with the standard reward. For failed cases, we incorporate the uncertainty reward into the training loop to improve learning robustness.
  • Figure 2: Entropy change during training under different uncertainty strategies. SELAUR maintains higher entropy throughout training, indicating stronger exploration compared to other methods.
  • Figure 3: Comparison of action traces in the WebShop environment. Top: SELAUR leverages uncertainty to explore alternative paths rather than being trapped in a single incorrect trajectory, eventually discovering the correct solution. Bottom: GiGPO exhibits low uncertainty, leading to repetitive and fixed behaviors.