Table of Contents
Fetching ...

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal

TL;DR

This work tackles the challenge of generating temporally coherent and physically plausible videos from prompts by decoupling motion planning from costly generation. It introduces SketchVerify, a training-free, test-time planning framework that samples multiple candidate trajectories, renders lightweight video sketches, and uses a multimodal verifier to select semantically valid and physically plausible plans before final diffusion-based synthesis. By verifying motion on sketch representations, it avoids expensive iterative full-video generation while achieving state-of-the-art performance on instruction following and physical coherence on WorldModelBench and PhyWorldBench. The approach demonstrates substantial efficiency gains, robust generalization, and clear ablation evidence that multimodal verification and sketch-based evaluation improve trajectory quality and realism.

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

TL;DR

This work tackles the challenge of generating temporally coherent and physically plausible videos from prompts by decoupling motion planning from costly generation. It introduces SketchVerify, a training-free, test-time planning framework that samples multiple candidate trajectories, renders lightweight video sketches, and uses a multimodal verifier to select semantically valid and physically plausible plans before final diffusion-based synthesis. By verifying motion on sketch representations, it avoids expensive iterative full-video generation while achieving state-of-the-art performance on instruction following and physical coherence on WorldModelBench and PhyWorldBench. The approach demonstrates substantial efficiency gains, robust generalization, and clear ablation evidence that multimodal verification and sketch-based evaluation improve trajectory quality and realism.

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

Paper Structure

This paper contains 25 sections, 3 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison of SketchVerify with other MLLM planning based video generation pipelines. Existing methods either rely on one-shot planning, which lacks correction, or iterative refinement, which requires repeated generation. Our method addresses both issues by selecting high-quality control plans using a multimodal verifier prior to synthesis.
  • Figure 2: Overview of our framework. Given a prompt and initial frame, we (1) decompose instructions and segment movable objects, (2) sample and verify candidate trajectories using lightweight video sketches scored by a multimodal verifier and (3) synthesize the final video using a trajectory-conditioned diffusion model. We provide more detail about the MLLM verifier in \ref{['fig:verifier']}.
  • Figure 3: Illustration of MLLM verifier. Given a video sketch and sub-instruction, the MLLM outputs semantic and physics scores used to rank candidate trajectories.
  • Figure 4: Qualitative comparison on four representative domains from WorldModelBench: Human, Natural, Video Game, and Robotics. Each group shows sampled frames from competing models given the same text prompt. Frames are uniformly sampled from each generated 81-frame video.
  • Figure 5: Ablation study on verifier modality. Introducing visual input to the verifier significantly improves both instruction following and physical plausibility, highlighting the importance of multimodal grounding for reliable trajectory evaluation.
  • ...and 8 more figures