Table of Contents
Fetching ...

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Wen-Tse Chen, Jiayu Chen, Fahim Tajwar, Hao Zhu, Xintong Duan, Ruslan Salakhutdinov, Jeff Schneider

TL;DR

The paper tackles sparse environmental feedback in sequential decision-making by introducing Retrospective In-Context Learning (RICL), which uses pretrained LLM knowledge to estimate the advantage function via retrospective in-context updates. Building on this, Retrospective In-Context Online Learning (RICOL) provides an online framework that combines RICL-based credit assignment with advantage-weighted regression and a KL-regularized policy update to achieve sample-efficient, multi-turn RL with LLMs. Empirical results in a 1D Key-Door task and four BabyAI scenarios show that RICL yields accurate, low-variance credit signals with very few samples, while RICOL outperforms strong baselines in sample efficiency and robustness to noisy feedback. Together, the approach demonstrates the potential of leveraging LLMs for temporal credit assignment, enabling more generalizable and data-efficient RL paradigms for language-conditioned tasks.

Abstract

Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

TL;DR

The paper tackles sparse environmental feedback in sequential decision-making by introducing Retrospective In-Context Learning (RICL), which uses pretrained LLM knowledge to estimate the advantage function via retrospective in-context updates. Building on this, Retrospective In-Context Online Learning (RICOL) provides an online framework that combines RICL-based credit assignment with advantage-weighted regression and a KL-regularized policy update to achieve sample-efficient, multi-turn RL with LLMs. Empirical results in a 1D Key-Door task and four BabyAI scenarios show that RICL yields accurate, low-variance credit signals with very few samples, while RICOL outperforms strong baselines in sample efficiency and robustness to noisy feedback. Together, the approach demonstrates the potential of leveraging LLMs for temporal credit assignment, enabling more generalizable and data-efficient RL paradigms for language-conditioned tasks.

Abstract

Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.
Paper Structure (29 sections, 1 theorem, 16 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 16 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\pi_0: \mathcal{S} \times \mathcal{A} \rightarrow (0, 1)$ and $\pi': \mathcal{S} \times \mathcal{A} \rightarrow (0, 1)$ be any two policies in an MDP with transition kernel $P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ and discount factor $\gamma \in [0,1]$. Then, th where $\beta > 0$ is a known scaling parameter, and $A_{r}^{\pi_0}(s, a)$ denotes the advantage fun

Figures (14)

  • Figure 1: The pipeline of retrospective in-context online learning, where step ② and step ③ represent retrospective in-context learning.
  • Figure 2: Comparison of error in advantage function estimation. The x-axis represents the number of trajectories used for estimation, while the y-axis shows the mean error between the estimated advantage and the ground-truth advantages. RICL achieves significantly lower error with fewer samples (around 10) compared to Monte Carlo, which requires about 1000 samples for similar accuracy. Additionally, the error of RICL is more stable across trials (lower variance).
  • Figure 3: Accuracy comparison of ICL and RICL on predicting expert actions in the BabyAI goto scenario across 1000 trajectories. The Base bar shows the zero-shot performance of LLaMA-3.1-8B-Instruct. RICL outperforms ICL by 7.2%, demonstrating the effectiveness of retrospective updates.
  • Figure 4: Comparison of our method (RICOL) against four baseline algorithms across four BabyAI scenarios. RICOL consistently demonstrates superior sample efficiency, achieving strong performance with significantly fewer interactions. Notably, RICOL outperforms both PPO (10M) and PPO (3B), by over $50 \times$ and $10 \times$ fewer environment steps, respectively. Compared to Reflexion, an in-context learning method using trajectory-level verbal feedback, RICOL exhibits better convergent performance by leveraging temporal credit assignment (from RICL) and state-specific feedback. Additionally, RICOL surpasses GPT-4o mini, despite using a smaller policy model (LLaMA-3.2-3B-Instruct), underscoring the importance of interactive learning from the environment. As a useful trick to boost performance, we use the real environment rewards as the advantage and apply advantage-weighted regression during the second stage of training, after RICOL completes its predefined training schedule in the first stage.
  • Figure 5: RICOL employs in-context credit assignment to generate dense learning signals, enabling more sample-efficient policy training. In contrast, RWR lacks credit assignment, depends on strong base policies with high initial success rates, and performs poorly on tasks with sparse rewards.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof