Table of Contents
Fetching ...

Seeking Physics in Diffusion Noise

Chujun Tang, Lei Zhong, Fangqiang Ding

Abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

Seeking Physics in Diffusion Noise

Abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
Paper Structure (27 sections, 5 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Progressive trajectory selection with a physics verifier. Given a text prompt, we sample $N$ denoising trajectories from different seeds and score each partially denoised sample at intermediate timesteps (e.g., $t{=}600,400$) using a lightweight physics verifier applied to frozen DiT features. At each checkpoint we keep the top fraction (e.g., $N/2$) and terminate the rest early, continuing denoising only for the survivors until a single winner is fully denoised ($t{=}0$).
  • Figure 2: Feature extraction pipeline. Given a generated video and its prompt, we encode the video with a VAE, add diffusion noise at timestep $t$, and run a frozen diffusion transformer. We extract hidden states at layer $\ell$, remove text-conditioning tokens, and spatially mean-pool video tokens to obtain per-frame features $\mathbf{f}^{(\ell)}_t \in \mathbb{R}^{F \times D}$. Each data sample is paired with PC and SA labels $y^{\text{pc}}, y^{\text{sem}} \in \{0,1\}$ and is used for linear probing (Sec. \ref{['sec:probing']}) and physics-verifier training (Sec. \ref{['sec:physics_head']}).
  • Figure 3: Source structure in DiT features. (a) UMAP of CogVideoX-2B DiT features extracted after transformer block $\ell{=}10$ at $t{=}200$, colored by the original video generator, reveals strong source clustering. (b) Cross-source probing AUC matrix (train $\times$ test source): within-source evaluation (diagonal, 5-fold CV; mean 0.637) is substantially higher than cross-source transfer (off-diagonal mean 0.538).
  • Figure 4: Qualitative comparison on PhyGenBench prompts (4 uniformly sampled frames per video). Mechanics: baseline pours oil downward under terrestial gravity; ours forms a floating liquid mass consistent with microgravity. Optics: the texture magnified through the baseline's lens appears incoherent with the surrounding leaf venation; ours produces a magnified view whose structure is consistent with the underlying leaf surface. Thermal and Physical Properties: baseline shows a liquid stream (incorrect); ours shows rising vapor consistent with sublimation.
  • Figure 5: Verifier score analysis. Top: kept vs. dropped trajectory scores at checkpoints. Bottom: per-prompt score spread (max $-$ min) at the first checkpoint. 5B shows meaningful separation; Wan shows near-complete overlap.
  • ...and 3 more figures