Table of Contents
Fetching ...

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian

TL;DR

GF addresses the gap between 2D video diffusion and 3D structure by aligning diffusion model representations with geometry-aware features from a pretrained 3D foundation model (VGGT) using Angular and Scale Alignment losses. This yields 3D-aware latent representations, enabling more consistent long-term video generation and the potential for explicit 3D reconstruction during inference. Empirical results on RealEstate10K and Minecraft demonstrate improved 3D consistency (lower FVD, better RPE/RVE) and perceptual quality compared to baselines, with ablations highlighting the effectiveness of VGGT supervision, the two alignment losses, and mid-level feature alignment. The work suggests a practical path to integrating 3D priors into video synthesis for more robust, memory-enabled world modeling, while noting scalability considerations for larger datasets and architectures.

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

TL;DR

GF addresses the gap between 2D video diffusion and 3D structure by aligning diffusion model representations with geometry-aware features from a pretrained 3D foundation model (VGGT) using Angular and Scale Alignment losses. This yields 3D-aware latent representations, enabling more consistent long-term video generation and the potential for explicit 3D reconstruction during inference. Empirical results on RealEstate10K and Minecraft demonstrate improved 3D consistency (lower FVD, better RPE/RVE) and perceptual quality compared to baselines, with ablations highlighting the effectiveness of VGGT supervision, the two alignment losses, and mid-level feature alignment. The work suggests a practical path to integrating 3D priors into video synthesis for more robust, memory-enabled world modeling, while noting scalability considerations for larger datasets and architectures.

Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Paper Structure

This paper contains 46 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Geometry Forcing equips video diffusion models with 3D awareness.(a) We propose Geometry Forcing (GF), a simple yet effective paradigm to internalize geometric-aware structure into video diffusion models by aligning with features from a pretrained geometric foundation model, i.e., VGGT wang2025vggt. (b) Compared to the baseline method dfot, our method produces more consistent generations both temporally and geometrically. (c) Features learned by the baseline model fail to reconstruct meaningful 3D geometry, whereas our method internalize 3D representation, enabling accurate 3D reconstruction from the intermediate features.
  • Figure 2: Qualitative comparison of camera view-conditioned video generation under full-circle rotation. Videos are generated from a single input frame and corresponding per-frame camera poses simulating a full 360° rotation. Our method (GF) is compared with DFoT dfot, VideoREPA zhang2025videorepa, and REPA zhang2025videorepa. The results demonstrate that the baseline methods fail to maintain temporal consistency, while our proposed GF consistently revisit the starting viewpoint.
  • Figure 3: Ablation study on alignment depth. We present FVD-256 and FVD-16 results for aligning VGGT to different layers of the diffusion model. The results suggest that mid-level feature alignment is most effective for improving long-term video quality.
  • Figure 4: Exposure bias analysis. This figure shows the trend of FVD scores during long-term video generation. Compared to the baseline, GF results in significantly lower FVD after 100 frames.
  • Figure 5: Qualitative comparisons on camera-conditioned video generation. All the videos are generated given first frame and per-frame camera pose. We comprehensively compare GF (ours) with DFoT dfot, VideoREPA zhang2025videorepa, REPA zhang2025videorepa. The results demostrate consistency in long-term video generation both inside (left) and outside (right) scenes.