Table of Contents
Fetching ...

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu

TL;DR

VChain tackles the inadequacy of current video generators in modeling dynamic causality by injecting high-level reasoning signals from multimodal models at inference time. It introduces a three-stage pipeline—Visual Thought Reasoning, Sparse Inference-Time Tuning, and Video Sampling—where a Chain of Visual Thoughts guides sparse LoRA-based fine-tuning of a diffusion-based video generator. The approach yields improved coherence, physical plausibility, and causal reasoning across complex scenarios while maintaining efficiency and avoiding dense supervision. Experimental results, including quantitative benchmarks and qualitative analyses, indicate consistent gains over baselines and robust ablations, underscoring the value of integrating reasoning signals into video synthesis.

Abstract

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

TL;DR

VChain tackles the inadequacy of current video generators in modeling dynamic causality by injecting high-level reasoning signals from multimodal models at inference time. It introduces a three-stage pipeline—Visual Thought Reasoning, Sparse Inference-Time Tuning, and Video Sampling—where a Chain of Visual Thoughts guides sparse LoRA-based fine-tuning of a diffusion-based video generator. The approach yields improved coherence, physical plausibility, and causal reasoning across complex scenarios while maintaining efficiency and avoiding dense supervision. Experimental results, including quantitative benchmarks and qualitative analyses, indicate consistent gains over baselines and robust ablations, underscoring the value of integrating reasoning signals into video synthesis.

Abstract

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

Paper Structure

This paper contains 25 sections, 4 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Overview of VChain. We introduce VChain, an inference-time tuning framework for reasoning in video generation. Given a user-provided prompt (e.g., “A rock and a feather are falling from the sky towards the ground.”), VChain leverages large multimodal models to generate a Chain of Visual Thoughts, which are a sparse set of causally important keyframes to guide the video generator via Sparse Inference-Time Tuning. VChain effectively improves reasoning in video generation without extensive re-training.
  • Figure 2: VChain Framework. An overview of our three-stage inference-time pipeline for reasoning in video generation. (a) Visual Thought Reasoning: Given a user-provided text prompt, a large multimodal model (GPT-4o) infers a causal chain of events and generates a sequence of keyframes, termed the Chain of Visual Thoughts, via iterative reasoning and image synthesis. (b) Sparse Inference-Time Tuning: These visual thoughts (paired with their corresponding textual thoughts) serve as sparse supervision for fine-tuning a pre-trained video generator via LoRA. (c) Video Sampling: The full sequence of textual thoughts is concatenated to form a single prompt, which is used to prompt the fine-tuned model in generating the final video output.
  • Figure 3: Qualitative Results - Baseline Comparison.T2V fails to capture the key causal interaction: the pins remain mostly static or jitter slightly, with no meaningful collision, revealing a lack of physical reasoning despite temporal coherence. T2V + Prompt Aug introduces relevant elements and motion, but the dynamics are erratic and implausible. Pins deform unnaturally, visual artifacts appear, and later frames become unstable, indicating poor spatial consistency. In contrast, VChain (Ours) produces a coherent and physically realistic sequence: the ball strikes the pins with plausible force, leading to consistent knockdown. Object geometry and material properties are well preserved across frames. These results show that VChain not only enables causal reasoning about the outcome of physical interactions, but also stabilizes spatial transitions.
  • Figure 4: Qualitative Results - Ablation Study. We compare VChain with two ablated variants. (1) Without Visual Thought: Although the model recognizes that the video should be in a first-person perspective based on the textual prompt, it fails to capture the correct visual pattern for a ball-catching viewpoint. In contrast, VChain leverages the reasoned Visual Thoughts to render step-by-step intermediate visual states of the throw-and-catch process. (2) Without Sparse Tuning: While Visual Thoughts are included, the model performs direct frame interpolation without tuning, leading to warping artifacts due to spatial misalignments among individual frames in Visual Thoughts. VChain (Ours) produces the most coherent and physically grounded interaction, correctly depicting the ball being thrown and caught from a first-person perspective. Removing either component degrades video synthesis quality.
  • Figure 5: Example of Visual Thoughts. We show the reasoned Visual Thoughts of the input prompt: "Concentrated sulfuric acid is poured onto a wooden table". The sequence illustrates our pipeline’s inferred causal progression across keyframes.
  • ...and 14 more figures