VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu
TL;DR
VChain tackles the inadequacy of current video generators in modeling dynamic causality by injecting high-level reasoning signals from multimodal models at inference time. It introduces a three-stage pipeline—Visual Thought Reasoning, Sparse Inference-Time Tuning, and Video Sampling—where a Chain of Visual Thoughts guides sparse LoRA-based fine-tuning of a diffusion-based video generator. The approach yields improved coherence, physical plausibility, and causal reasoning across complex scenarios while maintaining efficiency and avoiding dense supervision. Experimental results, including quantitative benchmarks and qualitative analyses, indicate consistent gains over baselines and robust ablations, underscoring the value of integrating reasoning signals into video synthesis.
Abstract
Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.
