Table of Contents
Fetching ...

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Frank S. He, Yang Liu, Alexander G. Schwing, Jian Peng

TL;DR

The paper introduces optimality tightening, a constrained optimization approach added to deep Q-learning to propagate rewards more efficiently, reducing training data and time. By enforcing multi-step bounds on Q-values derived from replayed sequences, the method accelerates convergence while maintaining stability via a quadratic penalty formulation. Empirical results on 49 Atari games show substantial improvements in training speed and performance, with the method achieving strong results using only 10M frames (vs. 200M for DQN) and often outperforming baselines across many titles. The approach is compatible with other DQN enhancements and holds practical potential for faster, data-efficient deep reinforcement learning in complex environments.

Abstract

We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and accuracy.

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

TL;DR

The paper introduces optimality tightening, a constrained optimization approach added to deep Q-learning to propagate rewards more efficiently, reducing training data and time. By enforcing multi-step bounds on Q-values derived from replayed sequences, the method accelerates convergence while maintaining stability via a quadratic penalty formulation. Empirical results on 49 Atari games show substantial improvements in training speed and performance, with the method achieving strong results using only 10M frames (vs. 200M for DQN) and often outperforming baselines across many titles. The approach is compatible with other DQN enhancements and holds practical potential for faster, data-efficient deep reinforcement learning in complex environments.

Abstract

We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and accuracy.

Paper Structure

This paper contains 8 sections, 10 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Improvements of our method trained on 10M frames compared to results of 200M frame DQN training presented by MnihNature2015, using the metric given in Eq. \ref{['eq:evaluation1']}.
  • Figure 2: Improvements of our method trained on 10M frames compared to results of 10M frame DQN training, using the metric given in Eq. \ref{['eq:evaluation1']}.
  • Figure 3: Game scores for our algorithm (blue) and DQN (black) using 10M training frames. 30 no-op evaluation is used and moving average over 4 points is applied.
  • Figure : Our algorithm for fast reward propagation in reinforcement learning tasks.
  • Figure S1: Convergence of mean and median of normalized percentages on 49 games.