Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
TL;DR
GF addresses the gap between 2D video diffusion and 3D structure by aligning diffusion model representations with geometry-aware features from a pretrained 3D foundation model (VGGT) using Angular and Scale Alignment losses. This yields 3D-aware latent representations, enabling more consistent long-term video generation and the potential for explicit 3D reconstruction during inference. Empirical results on RealEstate10K and Minecraft demonstrate improved 3D consistency (lower FVD, better RPE/RVE) and perceptual quality compared to baselines, with ablations highlighting the effectiveness of VGGT supervision, the two alignment losses, and mid-level feature alignment. The work suggests a practical path to integrating 3D priors into video synthesis for more robust, memory-enabled world modeling, while noting scalability considerations for larger datasets and architectures.
Abstract
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
