Intrinsic Credit Assignment for Long Horizon Interaction
Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge
TL;DR
This work addresses long-horizon uncertainty in open-ended information-seeking by introducing ΔBelief-RL, which uses the agent’s own belief updates as a dense intrinsic reward to credit intermediate actions. By measuring turn-wise belief changes with $\Delta\text{Belief}_t = \log b_t - \log b_{t-1}$ and combining it with the verifiable end reward, the method enables turn-level credit assignment via turn-wise GRPO without external reward models. Across 20 Questions and multiple model scales, ΔBelief-RL (CIA) yields superior information-seeking performance, faster belief updates, and stronger generalization to out-of-distribution tasks and practical applications, with performance scaling as test-time interaction budgets grow. The approach demonstrates robust improvements in interaction efficiency and opens pathways for scalable belief-calibrated learning in long-horizon, uncertain environments.
Abstract
How can we train agents to navigate uncertainty over long horizons? In this work, we propose ΔBelief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ΔBelief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ΔBelief rewards.
