Table of Contents
Fetching ...

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla

Abstract

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Abstract

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

Paper Structure

This paper contains 27 sections, 32 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: World-consistent Video Generations with VGGRPO. We compare the baseline video diffusion model (left, orange) with the VGGRPO-aligned model (right, green). Each example depicts a challenging dynamic scene; we visualize representative keyframes from the generated video and reconstructed scene geometry from the inferred 4D scene representation. VGGRPO produces markedly more coherent scene structure and smoother camera motion over time, reducing geometric drift and structural artifacts in challenging dynamic settings.
  • Figure 2: Method Overview. (a) Latent Geometry Model. We connect latents from the diffusion VAE encoder to a geometry foundation model via a lightweight connector, yielding a Latent Geometry Model that predicts 4D scene geometry directly from video latents. (b) VGGRPO training. We perform latent-space GRPO using two complementary rewards, camera motion smoothness and geometry reprojection consistency, computed entirely in latent space with the latent geometry model. Together, these components align the video diffusion model toward 4D world-consistent generation on both static and dynamic scenes.
  • Figure 3: Qualitative Comparison on Static and Dynamic Scenes. We show the first, middle, and last frames of video generations for a static-scene prompt (left) and a dynamic-scene prompt (right), with a representative segment of each prompt shown at the top. All baselines exhibit inconsistent artifacts, including geometric drift, temporal flicker, and unstable camera motion. In contrast, VGGRPO produces more coherent scene structure with smoother camera trajectories across frames.
  • Figure 4: Reward Components Ablation. The reconstructed scene visualizes the estimated camera trajectory (red curve), with the first and last frames shown below each reconstruction. Compared to the Baseline, optimizing the motion reward $r_{\mathrm{motion}}$ stabilizes camera motion, but geometric artifacts persist (green circle). Adding the reprojection consistency reward ($r_{\mathrm{motion}}{+}r_{\mathrm{geo}}$) further improves scene geometry while preserving smooth camera motion, proving the complementarity of both components.
  • Figure 5: Analysis of the Latent Geometry Model. We compare the latent geometry model with the original RGB-based geometry model on 50 RealEstate10K test sequences under controlled perturbations applied in the video latent space. The top row reports camera pose estimation performance as the perturbation scale $\alpha$ increases, measured by relative rotation accuracy (left, Racc@5), area under the accuracy curve (middle, AUC@5), and relative translation accuracy (right, Tacc@5). Our latent geometry model maintains stable performance across all noise levels, whereas the RGB-based geometry model degrades substantially as the perturbation grows. The bottom row shows a decoded RGB frame from the perturbed latents at different values of $\alpha$. Even when perturbations produce only barely perceptible visual changes in RGB space, the RGB-based geometry model already degrades, reflecting the distribution gap when applied to generated content rather than real images. Our latent geometry model, trained directly on generated latents, avoids this gap and remains robust.