Video Occupancy Models
Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E. Taylor, Sergey Levine
TL;DR
Video Occupancy Models (VOCs) propose a latent-space approach to video prediction for control, predicting the discounted distribution of future latent representations in a single step rather than pixel-level futures. By combining three latent-tokenization methods (VQ-VAE, inverse dynamics, and self-supervised distillation) with a GPT-based autoregressor and a generative TD objective, VOCs enable sampling of future representations and value estimation in the latent space. The authors demonstrate VOCs across multiple instantiations, show improved multi-step forecasting without iterative rollouts, and integrate VOCs into Model Predictive Control, where they achieve higher returns than baselines. The work highlights a scalable, control-oriented alternative to pixel-level video prediction, with future directions toward joint representation learning and longer-horizon dynamics using richer latent spaces.
Abstract
We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.
