Table of Contents
Fetching ...

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou, Wenzhao Zheng, Weiliang Chen, Yu Zheng, Jie Zhou, Jiwen Lu

Abstract

Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Abstract

Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
Paper Structure (27 sections, 8 equations, 10 figures, 6 tables)

This paper contains 27 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Examples of 3D Spatial Geometric Inconsistencies in Generated Videos. Existing models often fail to maintain geometric consistency, exhibiting critical 3D spatial failures despite plausible per-frame visuals. (a) Geometric Warping: The rigid structure of the static buildings severely distorts as the camera moves. (b) Incoherent Motion: The static workbench illogically "sticks" to and moves with the dynamic piece of wood, violating physical separation. (c) Object Impermanence: A static structure on the mountain "flickers" and illogically changes its shape, failing to persist over time. (d) Perspective Failure: The distant mountains, which should remain stable, unnaturally warp and "narrow" as the skier moves forward, violating 3D perspective.
  • Figure 2: Overview of the SGC computation pipeline. Input RGB frames undergo parallel processing: (i) depth estimation, leading to dense point reconstruction for global camera pose estimation; and (ii) pixel tracking followed by motion segmentation to isolate moving objects. The identified static background is then adaptively segmented. Local camera poses for these static sub-regions are subsequently estimated using information from pixel tracks and depth. Finally, the overall SGC score is computed by aggregating three key evaluations: local inter-segment consistency, global pose consistency, and cross-frame depth consistency.
  • Figure 3: Visualization of the Local Inter-Segment and Global Pose Consistency. Blue arrows depict image-plane projections of motion directions induced by local relative poses $P_{curr\_prev}^{local,j}$ of static sub-regions, while the red arrow denotes the projected global relative pose $P_{curr\_prev}^{VGGT}$. Arrows are obtained by projecting a unit 3D direction vector under each relative transformation, length-normalized for visualization. Sequences from left to right exhibit decreasing consistency scores. (Left) High consistency: local poses are similar and align with global motion. (Right) Low consistency: local poses show high variance and/or deviate significantly from global motion.
  • Figure 4: Qualitative Validation: SGC Detects Geometric Failures Missed by Feature Metrics. Latte (R1) scores poorly on SGC (object instability), despite a high VBench-BC score from plausible textures. VideoCrafter (R2) is consistent, scoring well on both. Seine exhibits catastrophic geometric breakdown (e.g., subject distortion), reflected in its very high SGC score. RT-1, despite high dynamics (robot motion), scores excellently, demonstrating SGC motion robustness.
  • Figure 5: Impact of Geometric Perturbations on SGC Score. This line plot presents a sensitivity and monotonicity analysis, illustrating SGC scores rising with geometric degradation severity (from 0.0 to 1.0) on the nuScenes dataset. Baseline (unperturbed) SGC scores for the best-performing (Cosmos, 0.0722) and worst-performing (Latte, 0.3226) generative models on the GenWorld dataset are included as dashed teal and red reference lines, respectively. This controlled test confirms that high SGC scores directly measure geometric failure and not semantics or texture shifts.
  • ...and 5 more figures