Table of Contents
Fetching ...

Can World Models Benefit VLMs for World Dynamics?

Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi, Zhen Dong, Sirui Han, Shanghang Zhang

TL;DR

This work investigates whether generative world-model priors trained on video data can serve as encoders for Vision–Language Models. By repurposing Stable Video Diffusion as a Generative Encoder and fusing its dynamics latents with a SigLIP semantic encoder, the authors form WorldLMs, with the best variant named Dynamic Vision Aligner (DyVA). DyVA achieves strong zero-shot multi-frame reasoning, surpassing several baselines on MindCube and other spatial benchmarks, due to motion-consistency priors learned from video data. However, semantic grounding on language-heavy tasks remains a challenge, suggesting the need for improved language-aligned objectives or joint training strategies to fully realize the potential of dynamics-rich encodings. Overall, the paper lays out a design-space and provides actionable insights toward a new family of world-model–informed multimodal learners that integrate temporal priors with language-based reasoning.

Abstract

Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.

Can World Models Benefit VLMs for World Dynamics?

TL;DR

This work investigates whether generative world-model priors trained on video data can serve as encoders for Vision–Language Models. By repurposing Stable Video Diffusion as a Generative Encoder and fusing its dynamics latents with a SigLIP semantic encoder, the authors form WorldLMs, with the best variant named Dynamic Vision Aligner (DyVA). DyVA achieves strong zero-shot multi-frame reasoning, surpassing several baselines on MindCube and other spatial benchmarks, due to motion-consistency priors learned from video data. However, semantic grounding on language-heavy tasks remains a challenge, suggesting the need for improved language-aligned objectives or joint training strategies to fully realize the potential of dynamics-rich encodings. Overall, the paper lays out a design-space and provides actionable insights toward a new family of world-model–informed multimodal learners that integrate temporal priors with language-based reasoning.

Abstract

Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.

Paper Structure

This paper contains 24 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: What will happen? From reasoning to dynamic intuition — comparing how VLM and WorldLM understand and predict real-world events.
  • Figure 2: Our analysis is structured around three dimensions: (i) Paradigm comparison between static and generative encoders (e.g., SigLIP vs. SVD); (ii) Benchmark diagnosis, revealing world model latents' strength (e.g., spatial/multi-frame reasoning) and weaknesses (e.g., language-heavy tasks); and (iii) Design-space exploration, probing different auxiliary encoders, resolutions, and training recipes to understand how world-model features aid visual understanding.
  • Figure 3: WorldLM Pipeline. A SigLIP encoder extracts semantic features from the input image. Concurrently, a Generative Encoder generates dynamic state tokens to capture temporal changes, using evenly spaced keyframe slots. All visual tokens are projected into a shared embedding space, concatenated with text tokens, and then fed into the LLM decoder.
  • Figure 4: Paradigm Comparison. We evaluate predicting 1, 4, 8, and 14 frames with a straightforward WorldLM setup. The radar chart (left) demonstrates that more frames boosts performance across various tasks, especially in visual reasoning. The qualitative example (right) illustrates that our WorldLM exhibits a distinct reasoning paradigm by envisioning, offering concise descriptions, stronger spatial grounding, and more structured temporal foresight compared to LLaVA.