Intrinsic Credit Assignment for Long Horizon Interaction

Ilze Amanda Auzina; Joschka Strüber; Sergio Hernández-Gutiérrez; Shashwat Goel; Ameya Prabhu; Matthias Bethge

Intrinsic Credit Assignment for Long Horizon Interaction

Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge

TL;DR

This work addresses long-horizon uncertainty in open-ended information-seeking by introducing ΔBelief-RL, which uses the agent’s own belief updates as a dense intrinsic reward to credit intermediate actions. By measuring turn-wise belief changes with $\Delta\text{Belief}_t = \log b_t - \log b_{t-1}$ and combining it with the verifiable end reward, the method enables turn-level credit assignment via turn-wise GRPO without external reward models. Across 20 Questions and multiple model scales, ΔBelief-RL (CIA) yields superior information-seeking performance, faster belief updates, and stronger generalization to out-of-distribution tasks and practical applications, with performance scaling as test-time interaction budgets grow. The approach demonstrates robust improvements in interaction efficiency and opens pathways for scalable belief-calibrated learning in long-horizon, uncertain environments.

Abstract

How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.

Intrinsic Credit Assignment for Long Horizon Interaction

TL;DR

and combining it with the verifiable end reward, the method enables turn-level credit assignment via turn-wise GRPO without external reward models. Across 20 Questions and multiple model scales, ΔBelief-RL (CIA) yields superior information-seeking performance, faster belief updates, and stronger generalization to out-of-distribution tasks and practical applications, with performance scaling as test-time interaction budgets grow. The approach demonstrates robust improvements in interaction efficiency and opens pathways for scalable belief-calibrated learning in long-horizon, uncertain environments.

Abstract

Paper Structure (84 sections, 4 equations, 12 figures, 6 tables)

This paper contains 84 sections, 4 equations, 12 figures, 6 tables.

Introduction
Background and Problem Setup
Training Environment
Agent Configurations
$\Delta$Belief-RL: Intrinsic Credit Assignment
Agent Beliefs
$\Delta$Belief reward: Belief Change Signal
Validating the $\Delta$Belief Measurement
Do Agent Beliefs Reflect Interactive Progress?
Does Optimizing Belief Updates Improve Task Success?
Training with Reinforcement Learning
Turn-wise GRPO
Experimental Details
Data and Models
RL Training.
...and 69 more sections

Figures (12)

Figure 1: Main contributions. We propose a dense reward signal, $\Delta$Belief Reward, based on agent intrinsic belief updates in long horizon tasks. We find that (1)$\Delta$Belief-RL leads to more sample-efficient training (2) our trained agent generalizes better to unseen information-seeking tasks; and (3) scales better with increased test-time interaction budget.
Figure 2: Belief updates. The per-turn beliefs of about the ground-truth Qwen3 (1.7B and 4B), split by final outcome. The trajectories were generated by DeepSeek-v3.2. On average, beliefs steadily increase and the rate of growth strongly correlates with the outcome of the trajectory.
Figure 3: Best-of-8 sampling with $\Delta\text{Belief}$. Success rate on the 20 Questions task for our baseline models Qwen3-1.7B and Qwen3-4B after SFT, as well as the base model Qwen3-8B. We compare regular question generation with $\Delta\text{Belief}$ sampling: we sample 8 questions at every turn and select the one that maximizes $\Delta\text{Belief}$. Across sizes, we observe a significant rise in performance when our signal is employed to guide the sampling of questions.
Figure 4: Training dynamics. For both, mean number of questions per episode (top) and mean fraction of repeated questions (bottom) during RL training, lower is better. Across both Qwen3-1.7B and Qwen3-4B, $\Delta$Belief-RL reduces the number of turns required to solve the game and suppresses redundant queries more rapidly than standard GRPO (StarPO).
Figure 5: Belief-update dynamics. Normalized elicited log-probability of the correct concept, $\log p_\theta(y_i \mid h_t, e_i)$, as a function of the number of interactions for 1.7B models (top) and 4B models (bottom). At the 4B scale, our method CIA shows the largest and most sustained increase in belief updates over multiple interactions, while StarPO remains close to the SFT baseline. For 1.7B models, both trained variants track the baseline closely.
...and 7 more figures

Intrinsic Credit Assignment for Long Horizon Interaction

TL;DR

Abstract

Intrinsic Credit Assignment for Long Horizon Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (12)