Table of Contents
Fetching ...

Video Occupancy Models

Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E. Taylor, Sergey Levine

TL;DR

Video Occupancy Models (VOCs) propose a latent-space approach to video prediction for control, predicting the discounted distribution of future latent representations in a single step rather than pixel-level futures. By combining three latent-tokenization methods (VQ-VAE, inverse dynamics, and self-supervised distillation) with a GPT-based autoregressor and a generative TD objective, VOCs enable sampling of future representations and value estimation in the latent space. The authors demonstrate VOCs across multiple instantiations, show improved multi-step forecasting without iterative rollouts, and integrate VOCs into Model Predictive Control, where they achieve higher returns than baselines. The work highlights a scalable, control-oriented alternative to pixel-level video prediction, with future directions toward joint representation learning and longer-horizon dynamics using richer latent spaces.

Abstract

We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.

Video Occupancy Models

TL;DR

Video Occupancy Models (VOCs) propose a latent-space approach to video prediction for control, predicting the discounted distribution of future latent representations in a single step rather than pixel-level futures. By combining three latent-tokenization methods (VQ-VAE, inverse dynamics, and self-supervised distillation) with a GPT-based autoregressor and a generative TD objective, VOCs enable sampling of future representations and value estimation in the latent space. The authors demonstrate VOCs across multiple instantiations, show improved multi-step forecasting without iterative rollouts, and integrate VOCs into Model Predictive Control, where they achieve higher returns than baselines. The work highlights a scalable, control-oriented alternative to pixel-level video prediction, with future directions toward joint representation learning and longer-horizon dynamics using richer latent spaces.

Abstract

We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.
Paper Structure (15 sections, 4 equations, 7 figures, 2 tables)

This paper contains 15 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Video Occupancy Models. Left. A small stack of pixel observations is encoded in a representation ${\mathbf{z}}_t$, which is then quantized to produce discrete tokens for a GPT model (denoted by $M(\cdot | {\mathbf{z}}_{t})$). A temporal target is encoded in a similar way by sampling the next representation ${\mathbf{z}}_{t+1}$ with probability $(1 - \gamma)$ or a bootstrap sample from the model conditioned on ${\mathbf{z}}_{t+1}$ with probability $\gamma$. The GPT model then does next token prediction on the concatenated tokens corresponding to the current representation (red tokens) and the temporal target (blue tokens). Right. The representation ${\mathbf{z}}_t$ is learnt in three different ways, including 1) quantized autoencoding such as VQ-VAE, 2) inverse dynamics modelling in the presence of action data, and 3) a self-supervised distillation based objective that gives away doing pixel-level reconstruction in favor of learning to predict in a latent space.
  • Figure 2: Gamma Variations. Visualization of predictions made by Video Occupancy Models (VOCs) with a VQ-VAE representation space, for different $\gamma$ values. The predictions are made in the latent space and then decoded via the VQ-VAE decoder. The bottom row shows the ground truth trajectory, with the observation highlighted with $$∎ (magenta) denoting the conditioning state for the VOC. Each row for a given $\gamma$ value includes five independent samples produced by the VOC. All predictions are based on a single forward pass through the model. As the $\gamma$ value increases, the model produces longer term predictions within a single step. For $\gamma=0$, we recover a standard 1-step model, with all predictions being identical to the ground truth next state. For high gamma values (e.g. $\gamma = 0.9$), the predictions are less similar to the ground truth observations, since the model is asked to produce long term predictions.
  • Figure 3: Model Rollouts. for Video Occupancy Models with $\gamma=0.8$ (top-left) and 1-step model, i.e. $\gamma=0.0$ (top-right). Both models are learnt over the same VQ-VAE representation space and are conditioned on the first observation (shown in magenta) in the ground truth trajectory (bottom row). $t$ refers to the timestep in the ground truth trajectory, while $t_{model}$ refers to the number of forward passes made by the model. For $\gamma=0.8$, a single sample from the model ($t_{model} = 1$) yields a farther-in-time ($t \ge 1$) prediction. For instance, the frame marked with yellow shows how the VOC model can predict the $t=6$ observation within $t_{model}=3$ steps. On the other hand, a 1-step model must be unrolled for multiple timesteps autoregressively to obtain a prediction a future prediction.
  • Figure 4: Return Distribution Estimation with VOCs. We train a reward model on the VQ representation space pre-discretization (reward loss is shown in the top row) and then use the learned reward model to plot the return distribution of a state by sampling from a Video Occupancy Model, and compare it with a one-step Model (Eq. \ref{['eq:value-estimation-sample']}).
  • Figure 5: Inverse Dynamics Modelling VOCs with varying codebook sizes. We compare VQ-VAE and quantized MUSIK representations for different codebook sizes. Results are for the cheetah domain. Value estimation follows Eq. \ref{['eq:value-estimation-sample']}. Standard codebook size is 1024.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 4.1