Table of Contents
Fetching ...

Towards General-Purpose Model-Free Reinforcement Learning

Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat

TL;DR

This paper addresses the challenge of creating a truly general-purpose reinforcement learning algorithm. It introduces MR.Q, a model-free method that leverages model-based representations by learning state and state-action embeddings $\mathbf{z}_s$ and $\mathbf{z}_{sa}$ to induce an approximately linear relationship with the value function, while still using nonlinear value estimation $\hat{Q}(\mathbf{z}_{sa})$. The encoder, dynamics predictor, and reward/terminal losses are trained end-to-end with target networks, multi-step returns, and TD3-style value updates, enabling a single hyperparameter set to cover diverse tasks. Empirically, MR.Q achieves competitive performance across Gym Locomotion, DMC, and Atari benchmarks with faster training and fewer parameters than typical model-based methods, illustrating the potential of dynamics-informed representations for general-purpose model-free RL. The results also reveal that universality across benchmarks remains challenging, motivating further work on robust representation learning and broader evaluation.

Abstract

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

Towards General-Purpose Model-Free Reinforcement Learning

TL;DR

This paper addresses the challenge of creating a truly general-purpose reinforcement learning algorithm. It introduces MR.Q, a model-free method that leverages model-based representations by learning state and state-action embeddings and to induce an approximately linear relationship with the value function, while still using nonlinear value estimation . The encoder, dynamics predictor, and reward/terminal losses are trained end-to-end with target networks, multi-step returns, and TD3-style value updates, enabling a single hyperparameter set to cover diverse tasks. Empirically, MR.Q achieves competitive performance across Gym Locomotion, DMC, and Atari benchmarks with faster training and fewer parameters than typical model-based methods, illustrating the potential of dynamics-informed representations for general-purpose model-free RL. The results also reveal that universality across benchmarks remains challenging, motivating further work on robust representation learning and broader evaluation.

Abstract

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

Paper Structure

This paper contains 31 sections, 6 theorems, 38 equations, 6 figures, 7 tables.

Key Result

Theorem 1

The fixed point of the model-free approach (eqn:semi_gradient_TD) and the solution of the model-based approach (eqn:linear_model_based) are the same.

Figures (6)

  • Figure 1: Summary of results. Aggregate mean performance across four common RL benchmarks and 118 environments featuring diverse characteristics (e.g., observation and action spaces, task types). Error bars capture a 95% stratified bootstrap confidence interval. Our algorithm, MR.Q, achieves a competitive performance against both state-of-the-art domain-specific and general baselines, while using a single set of hyperparameters. Notably, MR.Q accomplishes this with fewer network parameters and substantially faster training and evaluation speeds than general-purpose model-based methods.
  • Figure 2: Aggregate learning curves. Average performance over each benchmark. Results are over 10 seeds. The shaded area captures a 95% stratified bootstrap confidence interval. Due to action repeat, 500k time steps in DMC correspond to 1M frames in the original environment and 2.5M time steps in Atari corresponds to 10M frames in the original environment.
  • Figure 3: Gym - Locomotion learning curves. Results are over 10 seeds. The shaded area captures a 95% boostrap confidence interval.
  • Figure 4: DMC - Proprioceptive learning curves. Time steps consider the number of environment interactions, where 500k time steps equals 1M frames in the original environment. Results are over 10 seeds. The shaded area captures a 95% boostrap confidence interval.
  • Figure 5: DMC - Visual learning curves. Time steps consider the number of environment interactions, where 500k time steps equals 1M frames in the original environment. Results are over 10 seeds. The shaded area captures a 95% boostrap confidence interval.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof