Table of Contents
Fetching ...

Discrete Codebook World Models for Continuous Control

Aidan Scannell, Mohammadreza Nakhaei, Kalle Kujanpää, Yi Zhao, Kevin Sebastian Luck, Arno Solin, Joni Pajarinen

TL;DR

It is demonstrated that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings.

Abstract

In reinforcement learning (RL), world models serve as internal simulators, enabling agents to predict environment dynamics and future outcomes in order to make informed decisions. While previous approaches leveraging discrete latent spaces, such as DreamerV3, have demonstrated strong performance in discrete action settings and visual control tasks, their comparative performance in state-based continuous control remains underexplored. In contrast, methods with continuous latent spaces, such as TD-MPC2, have shown notable success in state-based continuous control benchmarks. In this paper, we demonstrate that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings. Based on these insights, we introduce DCWM: Discrete Codebook World Model, a self-supervised world model with a discrete and stochastic latent space, where latent states are codes from a codebook. We combine DCWM with decision-time planning to get our model-based RL algorithm, named DC-MPC: Discrete Codebook Model Predictive Control, which performs competitively against recent state-of-the-art algorithms, including TD-MPC2 and DreamerV3, on continuous control benchmarks. See our project website www.aidanscannell.com/dcmpc.

Discrete Codebook World Models for Continuous Control

TL;DR

It is demonstrated that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings.

Abstract

In reinforcement learning (RL), world models serve as internal simulators, enabling agents to predict environment dynamics and future outcomes in order to make informed decisions. While previous approaches leveraging discrete latent spaces, such as DreamerV3, have demonstrated strong performance in discrete action settings and visual control tasks, their comparative performance in state-based continuous control remains underexplored. In contrast, methods with continuous latent spaces, such as TD-MPC2, have shown notable success in state-based continuous control benchmarks. In this paper, we demonstrate that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings. Based on these insights, we introduce DCWM: Discrete Codebook World Model, a self-supervised world model with a discrete and stochastic latent space, where latent states are codes from a codebook. We combine DCWM with decision-time planning to get our model-based RL algorithm, named DC-MPC: Discrete Codebook Model Predictive Control, which performs competitively against recent state-of-the-art algorithms, including TD-MPC2 and DreamerV3, on continuous control benchmarks. See our project website www.aidanscannell.com/dcmpc.

Paper Structure

This paper contains 51 sections, 9 equations, 21 figures, 4 tables, 2 algorithms.

Figures (21)

  • Figure 1: World model trainingDCWM is a world model with a discrete latent space where each latent state is a discrete code ${\bm{c}}$ () from a codebook $\mathcal{C}$. Observations ${\bm{o}}$ are first mapped through the encoder and then quantized () into one of the discrete codes. We model probabilistic latent transition dynamics $p_{\phi}({\bm{c}}' \,|\, {\bm{c}}, {\bm{a}})$ as a classifier such that it captures a potentially multimodal distribution over the next state ${\bm{c}}'$ given the previous state ${\bm{c}}$ and action ${\bm{a}}$. During training, multi-step predictions are made using straight-through (ST) Gumbel-softmax sampling such that gradients backpropagate through time to the encoder. Given this discrete formulation, we train the latent space using a classification objective, i.e. cross-entropy loss. Making the latent representation stochastic and discrete with a codebook contributes to the very high sample efficiency of DC-MPC.
  • Figure 2: Illustration of Codebook ($\mathcal{C}$) FSQ's codebook is a $b\text{-dimensional}$ hypercube (left). This figure illustrates a $b\text{=3-dimensional}$ codebook, where each axis of the $3\text{-dimensional}$ hypercube (left) corresponds to one dimension of the codebook (right). The $i^{\text{th}}$ dimension of the hypercube is discretized into $L_{i}$ values, e.g., the $x$ and $y\text{-axis}$ are discretized into $L_{0}=L_{1}=5$ and the $z\text{-axis}$ into $L_{3}=4$. Code symbols (here integers) are normalized to the range $[-1,1]$.
  • Figure 3: Latent space ablation Evaluation of (i) discrete (Discrete) vs continuous (Continuous) latent spaces, (ii) using cross-entropy (CE) vs mean squared error (MSE) for the latent-state consistency loss, and (iii) formulating a deterministic (det) vs stochastic (stoch) dynamics model. Discretizing the latent space (red) improves sample efficiency over the continuous latent space (orange) and formulating stochastic dynamics and training with cross-entropy (purple) improves performance further.
  • Figure 4: Discrete encodings ablationDC-MPC with its discrete codebook encoding (purple) outperforms using DC-MPC with one-hot encoding (red) and label encoding (blue), in terms of both sample efficiency (left) and computational efficiency (right). Dynamics model used codes $p_{\phi}(\mathbf{c}' \,|\, \mathbf{c}, \mathbf{a})$ whilst reward $R_{\xi}(\mathbf{e}, \mathbf{a})$, critic $Q_{\psi}(\mathbf{e}, \mathbf{a})$ and prior policy $\pi_{\eta}(\mathbf{e})$ used the respective encoding $\mathbf{e}$.
  • Figure 5: Aggregate training curves in DMControl, Meta-World, & MyoSuiteDC-MPC generally matches TD-MPC2 whilst outperforming DreamerV3, SAC and TD-MPC across all tasks. We plot the mean (solid line) and the $95\%$ confidence intervals (shaded) across 3 seeds per task.
  • ...and 16 more figures