Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models
Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, Jeff Schneider
TL;DR
The paper tackles sparse environmental feedback in sequential decision-making by introducing Retrospective In-Context Learning (RICL), which uses pretrained LLM knowledge to estimate the advantage function via retrospective in-context updates. Building on this, Retrospective In-Context Online Learning (RICOL) provides an online framework that combines RICL-based credit assignment with advantage-weighted regression and a KL-regularized policy update to achieve sample-efficient, multi-turn RL with LLMs. Empirical results in a 1D Key-Door task and four BabyAI scenarios show that RICL yields accurate, low-variance credit signals with very few samples, while RICOL outperforms strong baselines in sample efficiency and robustness to noisy feedback. Together, the approach demonstrates the potential of leveraging LLMs for temporal credit assignment, enabling more generalizable and data-efficient RL paradigms for language-conditioned tasks.
Abstract
Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.
