Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning
Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, Jian Peng
TL;DR
This work tackles temporal credit assignment in episodic reinforcement learning by learning a dense, interval-based reward decomposition $\,\hat{r}(s_\alpha,a_\alpha)$ that sums to the episodic return $R(\tau)$. It develops a generalized policy gradient for composite rewards, augmented with a residual bias-correction term to ensure unbiased, lower-variance updates, and uses a Transformer to model forward-looking dependencies across time steps. Empirically, the method substantially improves sample efficiency on MuJoCo locomotion tasks, with Transformer-based reward predictors outperforming FF/LSTM and rivaling dense-reward baselines in several environments, while enabling interpretable attention patterns over time. The approach unifies sequence-modeling techniques with RL credit assignment, offering a principled way to leverage episodic signals for more efficient policy optimization.
Abstract
Recent advances in deep reinforcement learning algorithms have shown great potential and success for solving many challenging real-world problems, including Go game and robotic applications. Usually, these algorithms need a carefully designed reward function to guide training in each time step. However, in real world, it is non-trivial to design such a reward function, and the only signal available is usually obtained at the end of a trajectory, also known as the episodic reward or return. In this work, we introduce a new algorithm for temporal credit assignment, which learns to decompose the episodic return back to each time-step in the trajectory using deep neural networks. With this learned reward signal, the learning efficiency can be substantially improved for episodic reinforcement learning. In particular, we find that expressive language models such as the Transformer can be adopted for learning the importance and the dependency of states in the trajectory, therefore providing high-quality and interpretable learned reward signals. We have performed extensive experiments on a set of MuJoCo continuous locomotive control tasks with only episodic returns and demonstrated the effectiveness of our algorithm.
