Table of Contents
Fetching ...

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, Jian Peng

TL;DR

This work tackles temporal credit assignment in episodic reinforcement learning by learning a dense, interval-based reward decomposition $\,\hat{r}(s_\alpha,a_\alpha)$ that sums to the episodic return $R(\tau)$. It develops a generalized policy gradient for composite rewards, augmented with a residual bias-correction term to ensure unbiased, lower-variance updates, and uses a Transformer to model forward-looking dependencies across time steps. Empirically, the method substantially improves sample efficiency on MuJoCo locomotion tasks, with Transformer-based reward predictors outperforming FF/LSTM and rivaling dense-reward baselines in several environments, while enabling interpretable attention patterns over time. The approach unifies sequence-modeling techniques with RL credit assignment, offering a principled way to leverage episodic signals for more efficient policy optimization.

Abstract

Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.

Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

TL;DR

This work tackles temporal credit assignment in episodic reinforcement learning by learning a dense, interval-based reward decomposition that sums to the episodic return . It develops a generalized policy gradient for composite rewards, augmented with a residual bias-correction term to ensure unbiased, lower-variance updates, and uses a Transformer to model forward-looking dependencies across time steps. Empirically, the method substantially improves sample efficiency on MuJoCo locomotion tasks, with Transformer-based reward predictors outperforming FF/LSTM and rivaling dense-reward baselines in several environments, while enabling interpretable attention patterns over time. The approach unifies sequence-modeling techniques with RL credit assignment, offering a principled way to leverage episodic signals for more efficient policy optimization.

Abstract

Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.

Paper Structure

This paper contains 21 sections, 1 theorem, 15 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

I) Denote by $\hat{J}(\theta) := \mathbb{E}_{\pi_\theta}[\hat{R}(\tau)]$ the expectation of the composite reward $\hat{R}(\tau)$, we have where $\Gamma_\alpha = \{ t \colon t \leq \max(\alpha)\}$ and $\max(\alpha)$ denotes the maximum element of set $\alpha$; note that $\Gamma_\alpha$ is the set of all $t$ that $\nabla_\theta \log \pi(a_t| s_t)$ should multiply by $\hat{r}(s_\alpha, t_\alpha)$. I

Figures (7)

  • Figure 1: Learning curves for PPO baselines and our proposed method on tasks with episodic rewards. Mean and standard deviation over 5 random seeds is plotted. The x- and y- axis represent the number of training samples (in million) and average return, respectively.
  • Figure 2: Ablation analysis on choices of (a) network structures for reward function; (b) strategies for data collection; (c)-(d) methods with or without bias correction under different learning rates (lr). Environment Walker2d was used to perform the analyses. Mean and standard deviation over 5 random seeds is plotted.
  • Figure A1: Overview of our approach. Rollout trajectories are generated from interacting with the environment. A reward predictor is trained on the collected trajectories and episodic returns with the regression loss. Then the predicted rewards are used for policy optimization.
  • Figure A2: Network structures for the reward predictor: (a) Feed-forward network, (b) LSTM network, (c) Transformer network
  • Figure A3: Comparison between different buffer updating methods. The x-axis denotes the number of training samples and y-axis denotes the average episodic return. The red curve represents the training curve using episodic return. The yellow, blue and green curves represent the algorithm with online buffer scheme, historical-online scheme and stratified-sampling scheme, respectively.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1