Table of Contents
Fetching ...

Programmatic Video Prediction Using Large Language Models

Hao Tang, Kevin Ellis, Suhas Lohit, Michael J. Jones, Moitreya Chatterjee

TL;DR

ProgGen tackles video frame prediction by learning a world model through neuro-symbolic states and a triad of LLM/VLM-synthesized programs for perception, dynamics, and rendering. It combines a perception program $\mathcal{P}$, a dynamics program $\mathcal{D}$ with global parameters $\theta$, and a rendering program $\mathcal{R}$ to encode frames into interpretable states $s_i$, forecast future states, and render them back into RGB frames, guided by Affordance Rules. Training follows a two-stage approach, first discovering the programs and then optimizing continuous parameters, with a surrogate state-level loss to improve efficiency. Empirically, ProgGen achieves strong, data-efficient performance on PhyWorld and Cart Pole, outperforming diffusion baselines in out-of-distribution settings and enabling counterfactual editing and interpretable video generation, underscoring its potential for sample-efficient, controllable video synthesis in robotics and perception tasks.

Abstract

The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.

Programmatic Video Prediction Using Large Language Models

TL;DR

ProgGen tackles video frame prediction by learning a world model through neuro-symbolic states and a triad of LLM/VLM-synthesized programs for perception, dynamics, and rendering. It combines a perception program , a dynamics program with global parameters , and a rendering program to encode frames into interpretable states , forecast future states, and render them back into RGB frames, guided by Affordance Rules. Training follows a two-stage approach, first discovering the programs and then optimizing continuous parameters, with a surrogate state-level loss to improve efficiency. Empirically, ProgGen achieves strong, data-efficient performance on PhyWorld and Cart Pole, outperforming diffusion baselines in out-of-distribution settings and enabling counterfactual editing and interpretable video generation, underscoring its potential for sample-efficient, controllable video synthesis in robotics and perception tasks.

Abstract

The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of ProgGen showing the meta program that solves the video prediction task by: (i) generating a program for perception and estimation of object states ($\mathcal{P}$), (ii) estimating the world model (i.e. the dynamics of the video) program synthetically ($\mathcal{D}$), and (iii) rendering future frames using a rendering program ($\mathcal{R}$). The figure also illustrates the output of these steps for the Cart Pole setting, including the principal objects that have been discovered (such as "the black cart"), their states, the generated program for estimating the dynamics and the predicted frames of the video.
  • Figure 2: Plate diagram depicting the graphical model of ProgGen, during inference. The first $F+1$ frames are used to set the visual conditioning while the subsequent frames are predicted.
  • Figure 3: Qualitative visualization of frames predicted by our method for the out of domain, two ball collision case.
  • Figure 4: Qualitative visualization of frame prediction results between competing methods on the Cart Pole environment.
  • Figure 5: Frame prediction results for our method upon: (a) doubling the pole length; (b) switching the direction of motion the cart.
  • ...and 1 more figures