Table of Contents
Fetching ...

Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

Demystifing Video Reasoning

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
Paper Structure (26 sections, 3 equations, 17 figures, 3 tables)

This paper contains 26 sections, 3 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Chain-of-Steps. We discover that video reasoning occurs along the diffusion steps with surprising emergent behaviors such as making multiple possible moves (e.g., paths) simultaneously at early steps, gradually pruning suboptimal choices during middle steps, and reaching a final decision at the late steps. This maze-solving example asks the model to start from the green circle in the top-left corner and find the red rectangle. Key regions of interest are color-coded and enlarged on the right.
  • Figure 2: Chain-of-Steps elicits reasoning along the diffusion process. We observe that video reasoning models explore multiple possible solutions simultaneously in the early denoising steps before converging to a final outcome in later steps. Specifically, we observe: (a) two potential routes (cyan arrows highlight the "imaginary traces") for the robot; (b) two possible placements of the "O" piece; (c) multiple candidate end positions for the plant; (d) simultaneous selection of two diamonds; (e) large and small circles overlapping with each other; and (f) all possible rotations of the L-shaped object superimposed.
  • Figure 3: Noise perturbation and information flow. (a) Illustration of noise injection schemes; "Noise at Step" suffers more significant corruption than "Noise at Frame". (b) Performance drop with the two noise injection schemes. X-axis is the injection index (either diffusion step or frame). (c) Information flow across denoising steps (CKA dissimilarity: 1.0 indicates complete corruption, 0.0 indicates no effect).
  • Figure 4: Emergent reasoning behaviors: memory and self-correction. (a) The center point is retained to guide the return motion. (b) The contour of the occluded small teddy bear is preserved, enabling the model to address object permanence. (c) The trajectory of the ball gradually extends and becomes complete. (d) The missing cube only appears in the later diffusion steps. Cyan boxes are added for illustration; they are not part of the generated video.
  • Figure 5: Emergent reasoning behavior: understanding before reasoning. (a) Early diffusion steps identify the car as the object of interest, while later steps introduce motion and simulate physical interactions. (b) Early steps recognize the door as the target object, and later steps manipulate it.
  • ...and 12 more figures