Table of Contents
Fetching ...

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

TL;DR

This work probes whether unified vision-language models can perform forward dynamics prediction (FDP) and finds that they struggle to produce physically plausible frame transitions, despite inverse dynamics prediction (IDP) being comparatively easier. The authors propose two IDM-based bootstrapping strategies: (i) weakly supervised learning by annotating large-scale unlabelled videos with actions predicted by an IDM to generate synthetic FDP data, and (ii) inference-time verification where the IDM scores multiple FDP samples and selects the best one. Through extensive experiments on Aurora-Bench with Chameleon-7B and Liquid-8B, the approach yields competitive, and in some cases state-of-the-art, performance on action-centric image editing, with GPT4o-as-judge showing 7–13% gains and human evaluators favoring the FDP-bootstrapped models. The results provide evidence that FDP can be learned in general-purpose VLMs by leveraging IDM, offering a path toward long-horizon, language-conditioned world models, while also highlighting limitations such as copying and dataset variability and raising considerations for responsible deployment.

Abstract

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between $7\%$ and $13\%$ according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models

TL;DR

This work probes whether unified vision-language models can perform forward dynamics prediction (FDP) and finds that they struggle to produce physically plausible frame transitions, despite inverse dynamics prediction (IDP) being comparatively easier. The authors propose two IDM-based bootstrapping strategies: (i) weakly supervised learning by annotating large-scale unlabelled videos with actions predicted by an IDM to generate synthetic FDP data, and (ii) inference-time verification where the IDM scores multiple FDP samples and selects the best one. Through extensive experiments on Aurora-Bench with Chameleon-7B and Liquid-8B, the approach yields competitive, and in some cases state-of-the-art, performance on action-centric image editing, with GPT4o-as-judge showing 7–13% gains and human evaluators favoring the FDP-bootstrapped models. The results provide evidence that FDP can be learned in general-purpose VLMs by leveraging IDM, offering a path toward long-horizon, language-conditioned world models, while also highlighting limitations such as copying and dataset variability and raising considerations for responsible deployment.

Abstract

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP), effectively captioning the action between frames, is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between and according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Paper Structure

This paper contains 42 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: A high-level illustration of our two strategies to bootstrap Forward Dynamics Prediction from Inverse Dynamics Prediction in unified Vision--Language Models: (i) synthesising trajectories for weak supervision (left) and (ii) inference-time verification of candidate future observations (right).
  • Figure 2: Percentage of times 9 VLMs assign higher probability to observation--action--observation Reference trajectories compared with 4 types of Negative (i.e., adversarially manipulated) trajectories, for both forward dynamics and inverse dynamics prediction. Higher values are better.
  • Figure 3: Heatmap visualization of image token weights predicted by the recognition model on examples from AG, Something-Something, MagicBrush, and Kubric, and UCF-101, Kinetics700 and MIT.
  • Figure 4: GPT-4o scores for test-time verification with $K$ samples, where $K \in \{1, 2, 4, 8\}$. We use a blue line for C-FT and a red line for L-FT. For C-FT, we plot the standard deviation as the shaded area due to its large variance. We indicate the scores for GoT (GT) and SmartEdit (SE) as horizontal lines.
  • Figure 5: A qualitative case of real-world next-observation prediction, demonstrating C-FDM's ability to steer predictions using language and perform sequential predictions. More cases from Aurora-Bench are in Appendix \ref{['appendix:qualitative-case']}.
  • ...and 7 more figures