Table of Contents
Fetching ...

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

Claas A Voelcker, Marcel Hussing, Eric Eaton, Amir-massoud Farahmand, Igor Gilitschenski

TL;DR

This paper addresses one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions by augmenting the off-policy RL training process with a small amount of data generated from a learned world model.

Abstract

Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for TD Learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

TL;DR

This paper addresses one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions by augmenting the off-policy RL training process with a small amount of data generated from a learned world model.

Abstract

Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for TD Learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.

Paper Structure

This paper contains 35 sections, 1 theorem, 19 equations, 22 figures, 2 tables.

Key Result

Proposition 1

Let $P$ be a stochastic matrix. Define the discounted state occupancy distribution $\mu$ of $P$ for some starting state distribution $\rho$ and some discount factor $\gamma \in [0,1)$ as Let $D$ be a diagonal matrix whose entries correspond to the discounted state occupancy distribution. Then the matrix $D(I - \gamma P)$ is positive definite.

Figures (22)

  • Figure 1: A visualization of the core issue we investigate. Even if a replay buffer contains good coverage for two policies ($\pi_\mathrm{old}$ and $\pi_\mathrm{new}$) starting from $\rho=x_0$, this does not guarantee that it contains a transition for executing an action under the new policy on a state visited under the old. However, this state-action pair's value estimate is used to update the value of state $x_0$ via \ref{['eq:off_policy_q_update']}, without being grounded in an observed transition.
  • Figure 2: Left: the train, validation, and on-policy validation error of the Q function at UTD 1. Right: the Q values and return curves of TD3 agents across different UTD 1, 8, and 16.
  • Figure 3: Return curves for the dog tasks with differing UTD values. The return increases or remains stable when training with MAD-TD. Without model data, the performance decreases under high UTD. MPC is turned off in these runs to cleanly evaluate the impact of model data on critic learning.
  • Figure 4: Mean loss values with and without generated data (see \ref{['fig:q_eval']}) for UTD 1.
  • Figure 5: Performance comparison on the hard tasks for MAD-TD, BRO, and TD-MPC, with varying number of steps and action repeat settings. MAD-TD is on par with all baselines, has higher mean and IQM when trained for 2 million time steps and action repeat 2, and strongly outperforms TD-MPC2 and BRO at 1 million time steps with action repeat 2.
  • ...and 17 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof