Table of Contents
Fetching ...

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, Yoshua Bengio

TL;DR

The paper introduces a backtracking model that learns backward transitions from high-value states to generate Recall Traces, enabling imitation of alternative trajectories toward valuable outcomes. This variational framework prioritizes sparse-reward learning by focusing experiences around high-reward regions and improves sample efficiency for both on-policy and off-policy RL methods. Empirical results across diverse tasks show faster learning and better exploration than traditional experience replay baselines like PER, and demonstrate benefits when integrating GoalGAN-generated high-value seeds. The approach is straightforward to combine with standard RL algorithms and offers a principled, scalable way to leverage backward dynamics for reinforcement learning.

Abstract

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

TL;DR

The paper introduces a backtracking model that learns backward transitions from high-value states to generate Recall Traces, enabling imitation of alternative trajectories toward valuable outcomes. This variational framework prioritizes sparse-reward learning by focusing experiences around high-reward regions and improves sample efficiency for both on-policy and off-policy RL methods. Empirical results across diverse tasks show faster learning and better exploration than traditional experience replay baselines like PER, and demonstrate benefits when integrating GoalGAN-generated high-value seeds. The approach is straightforward to combine with standard RL algorithms and offers a principled, scalable way to leverage backward dynamics for reinforcement learning.

Abstract

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

Paper Structure

This paper contains 27 sections, 5 equations, 19 figures, 3 tables, 2 algorithms.

Figures (19)

  • Figure 1: The policy explores the state space $\mathcal{S}$ from an initial state. Discovered high value states are then passed to the backtracking model (dashed-lines) to generate new traces that may have led to this high value state.
  • Figure 2: Training curves from the Four Room Environment for the Actor-Critic baseline (blue) and the backtracking model augmented Actor-Critic (orange). For the size-19 environment, several of the Actor-Critic baselines failed to converge, whereas the augmented recall trace model always succeeded in the number of training steps considered. For additional results see Figure \ref{['fig:four_room_results_full']} in Appendix.
  • Figure 3: Visitation count visualization of trained policies for PER (left) and Recall Traces (right) for two 4-room grid sizes.
  • Figure 4: Plots for reward vs. time steps, comparing the performance of recall traces (labeled BacktrackingModel), PER and baseline Actor Critic (AC).
  • Figure 5: Visualization of GoalGAN baseline (b) vs backtracking model (c) policy performance for different parts of the state space for Ant Maze task. Red indicates complete success; blue indicates failure. Backtracking model achieves equal coverage rates in fewer steps of training.
  • ...and 14 more figures