Table of Contents
Fetching ...

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Kaleb Newman, Tyler Zhu, Olga Russakovsky

Abstract

Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

Video Models Reason Early: Exploiting Plan Commitment for Maze Solving

Abstract

Video diffusion models exhibit emergent reasoning capabilities like solving mazes and puzzles, yet little is understood about how they reason during generation. We take a first step towards understanding this and study the internal planning dynamics of video models using 2D maze solving as a controlled testbed. Our investigations reveal two findings. Our first finding is early plan commitment: video diffusion models commit to a high-level motion plan within the first few denoising steps, after which further denoising alters visual details but not the underlying trajectory. Our second finding is that path length, not obstacle density, is the dominant predictor of maze difficulty, with a sharp failure threshold at 12 steps. This means video models can only reason over long mazes by chaining together multiple sequential generations. To demonstrate the practical benefits of our findings, we introduce Chaining with Early Planning, or ChEaP, which only spends compute on seeds with promising early plans and chains them together to tackle complex mazes. This improves accuracy from 7% to 67% on long-horizon mazes and by 2.5x overall on hard tasks in Frozen Lake and VR-Bench across Wan2.2-14B and HunyuanVideo-1.5. Our analysis reveals that current video models possess deeper reasoning capabilities than previously recognized, which can be elicited more reliably with better inference-time scaling.

Paper Structure

This paper contains 38 sections, 5 equations, 41 figures, 6 tables, 1 algorithm.

Figures (41)

  • Figure 1: Video diffusion models plan early. Decoded intermediate $\hat{x}_0$ predictions reveal that the model commits to a trajectory within the first few denoising steps (green box); later steps refine visual details but rarely alter the path (blue box).
  • Figure 2: Overview of ChEaP. (Left) Early Planning Beam Search scores early plans from partially denoised predictions and selects the most promising candidates for full generation. (Right) Chaining reconditions on the last frame of successful traces to extend reasoning beyond the single-generation horizon.
  • Figure 3: Early plans stay consistent. (Left) Across multiple settings, the early trajectories emerging from decoded $\hat{x}_0$ predictions at step 5 match the final trajectory. (Right) Mean trajectory convergence throughout the denoising process. Step 5 already reaches 93%, trajectories stay converged (over 163 $4{\times}4$ mazes).
  • Figure 4: Stepwise refinement. Mean pairwise trajectory IoU among $K{=}5$ re-noised completions at each step $\tau$, across grid sizes 4--10. Even at $\tau{=}1$, branch trajectories are far more similar to each other than trajectories from different seeds (dashed line), indicating that the route is largely encoded in the initial noise sample.
  • Figure 5: EPBS finds solutions much more efficiently than best-of-$N$. Accuracy vs Function Evaluations (NFEs) on Frozen Lake mazes across four sizes with Wan2.2-14B. EPBS consistently dominates standard best-of-$N$, with large gains on larger mazes.
  • ...and 36 more figures