Table of Contents
Fetching ...

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji

TL;DR

Latent Reward addresses credit assignment in episodic reinforcement learning with sparse, delayed feedback by introducing a multifaceted latent reward $z_r$ and a LLM-empowered encoding function $\phi: \mathcal{S} \times \mathcal{A} \to \mathcal{D}$, decoded by $f: \mathcal{D} \to \mathbb{R}$ to produce proxy rewards. LaRe integrates task-related priors from large language models through Environment Prompting and a self-verification loop to generate executable latent-reward encoders, then uses a reward decoder to improve temporal credit assignment. The approach yields a tighter concentration bound $\| r-\hat{r}^\phi_k \|_{A^\phi_k} \le l^\phi_k$ and a regret bound $\rho^\phi(K) \le O(T \|\mathcal{D}\| \sqrt{K})$, enabling more accurate reward modeling and faster learning. Empirically, LaRe achieves superior temporal credit assignment on MuJoCo and MPE benchmarks, enhances multi-agent credit allocation, and, in some tasks, even matches or surpasses policies trained with ground-truth dense rewards, highlighting the value of semantically interpretable latent rewards guided by LLM priors.

Abstract

Reinforcement learning (RL) often encounters delayed and sparse feedback in real-world applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.

Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

TL;DR

Latent Reward addresses credit assignment in episodic reinforcement learning with sparse, delayed feedback by introducing a multifaceted latent reward and a LLM-empowered encoding function , decoded by to produce proxy rewards. LaRe integrates task-related priors from large language models through Environment Prompting and a self-verification loop to generate executable latent-reward encoders, then uses a reward decoder to improve temporal credit assignment. The approach yields a tighter concentration bound and a regret bound , enabling more accurate reward modeling and faster learning. Empirically, LaRe achieves superior temporal credit assignment on MuJoCo and MPE benchmarks, enhances multi-agent credit allocation, and, in some tasks, even matches or surpasses policies trained with ground-truth dense rewards, highlighting the value of semantically interpretable latent rewards guided by LLM priors.

Abstract

Reinforcement learning (RL) often encounters delayed and sparse feedback in real-world applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.

Paper Structure

This paper contains 38 sections, 6 theorems, 21 equations, 15 figures, 4 tables, 1 algorithm.

Key Result

Proposition 0

Let $\lambda>0$ and $A^\phi_{k}\stackrel{\text{def}}{=} (H^\phi_k)^TH^\phi_k +\lambda I_{\mathcal{\lVert D\rVert}}$. For any $\delta\in(*){0,1}$, with probability greater than $1-\delta/10$ uniformly for all episode indexes $k\ge0$, it holds that

Figures (15)

  • Figure 1: Overview of LaRe. (a) The probabilistic model of the episodic reward with the latent reward $z_{r, t}$ introduced as the implicit variable. (b) The LaRe framework consists of three main components: (1) Environment Prompting: the task information is incorporated into a standardized prompt for LLM instructions (details are in Appendix A). (2) Latent Reward Self-verification: during the self-prompting phase, LLM generates $n$ candidate responses $\{\xi_i\}_{i=1}^n$ and synthesizes an improved response $\xi$. In the pre-verification phase, the executability of the function $\phi$ is verified with pre-collected random states $\bar{s}$; (3) Contribution Allocation: latent rewards $z_{r, t}$ are derived by $\phi$ and used to estimate proxy rewards via the reward decoder model $f_\psi$.
  • Figure 2: Average episode return for tasks with different state space dimensions in MuJoCo. Notably, TD3-DR is trained with dense rewards.
  • Figure 3: Average episode return for tasks with a varied number of agents in MPE. Notably, IPPO-DR is trained with dense rewards and LaRe w/o AD represents LaRe without credit assignment among agents.
  • Figure 4: (a) The task HumanoidStandup-v4 aims to make the humanoid stand up and maintain balance. (b) LLM-generated latent rewards additionally consider implicit factors affecting stability compared to the ground truth rewards. (c) Comparison between LaRe and RD on the competitive Predator-Prey (6 agents) task. 'X vs Y' represents the condition where X controls preys and Y controls predators. LaRe outperforms RD when directly pitted against it.
  • Figure 5: Ablation studies of the reward model and the proposed self-verified LLM generation, as well as comparisons of LaRe with the variational information bottleneck.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Proposition 0: Tighter Concentration Bound of Reward
  • Proposition 0: Tighter Regret Bound
  • Theorem 1: abbasi2011improved, Theorem 2
  • Proposition 1: Tighter Concentration Bound of Reward
  • proof
  • Lemma 1: efroni2021reinforcement, Lemma 8
  • Proposition 1: Tighter Regret Bound
  • proof