Table of Contents
Fetching ...

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

Jashaswimalya Acharjee, Balaraman Ravindran

TL;DR

ULD proposes a unified latent-dynamics RL framework that achieves model-based representational benefits within a model-free execution paradigm. By learning state-action embeddings that render $Q^\pi(s,a)$ approximately linear, ULD attains cross-domain effectiveness with a single hyperparameter setup across continuous and visual tasks as well as discrete Atari games. Theoretical results show fixed-point equivalence between model-free TD updates and model-based value expansions, plus a bound tying value error to embedding and dynamics accuracy. Empirically, ULD matches or surpasses specialized baselines across 80 tasks with reduced computational overhead, suggesting that robust latent representations are a key driver of the gains traditionally attributed to planning.

Abstract

We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains -- from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines -- achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

TL;DR

ULD proposes a unified latent-dynamics RL framework that achieves model-based representational benefits within a model-free execution paradigm. By learning state-action embeddings that render approximately linear, ULD attains cross-domain effectiveness with a single hyperparameter setup across continuous and visual tasks as well as discrete Atari games. Theoretical results show fixed-point equivalence between model-free TD updates and model-based value expansions, plus a bound tying value error to embedding and dynamics accuracy. Empirically, ULD matches or surpasses specialized baselines across 80 tasks with reduced computational overhead, suggesting that robust latent representations are a key driver of the gains traditionally attributed to planning.

Abstract

We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains -- from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines -- achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.
Paper Structure (33 sections, 6 theorems, 21 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 33 sections, 6 theorems, 21 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

theorem 1

The fixed point of the model-free update (eq:td_update) and the model-based solution (eq:model_based) are identical.

Figures (4)

  • Figure 1: Aggregate learning curves. Average performance over each benchmark. Results are over 10 seeds. Due to action repeat, 500k time steps in DMC correspond to 1M frames in the original environment and 2.5M time steps in Atari corresponds to 10M frames in the original environment.
  • Figure 2: Aggregate metrics comparison: (Left) Gym Locomotion, (Right) DMC Proprioceptive.
  • Figure 3: Performance comparison across Gym-Locomotion tasks. Bars show final average return at 1M time steps over 10 seeds. Error bars represent 95% bootstrap confidence intervals.
  • Figure 4: Performance comparison across all 28 DMC-Proprioceptive tasks. Bars show final average return at 500k time steps over 10 seeds. Error bars represent 95% bootstrap confidence intervals.

Theorems & Definitions (6)

  • theorem 1: Solution Equivalence
  • theorem 2: Error Bound
  • theorem 3: Non-Linear Representation
  • theorem 4: Solution Equivalence
  • theorem 5: Error Bound
  • theorem 6: Non-Linear Representation