SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang; Xiaoou Liu; Lu Cheng; Yaqing Wang; Kenton Murray; Hua Wei

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei

TL;DR

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability.

Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 3 figures, 3 tables)

This paper contains 22 sections, 3 equations, 3 figures, 3 tables.

Introduction
Related Work
Method
Uncertainty Estimation
Entropy.
Least confidence.
Margin.
Aggregated Uncertainty.
Step and Trajectory Aggregation
Step-level aggregation.
Trajectory-level aggregation.
Failure-aware Reward Shaping
Step-wise shaping.
Trajectory-level shaping.
Experiments
...and 7 more sections

Figures (3)

Figure 1: Overview of our method, SELAUR. (a) Uncertainty Estimation: We employ three approaches to estimate model uncertainty: entropy, confidence, and margin, derived from token probability distributions. (b) Step and Trajectory Aggregation: Throughout the model’s process, we compute two types of rewards that contribute to the uncertainty reward. At the step level, token-level uncertainties are aggregated. At the trajectory level, step-level uncertainties are combined with varying weights, where later steps receive higher weights in the overall trajectory reward. (c) Failure-aware Reward Reshaping: For successful cases, the model is trained with the standard reward. For failed cases, we incorporate the uncertainty reward into the training loop to improve learning robustness.
Figure 2: Entropy change during training under different uncertainty strategies. SELAUR maintains higher entropy throughout training, indicating stronger exploration compared to other methods.
Figure 3: Comparison of action traces in the WebShop environment. Top: SELAUR leverages uncertainty to explore alternative paths rather than being trapped in a single incorrect trajectory, eventually discovering the correct solution. Bottom: GiGPO exhibits low uncertainty, leading to repetitive and fixed behaviors.

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

TL;DR

Abstract

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Authors

TL;DR

Abstract

Table of Contents

Figures (3)