Table of Contents
Fetching ...

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang

Abstract

Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Abstract

Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
Paper Structure (37 sections, 11 equations, 5 figures, 2 tables)

This paper contains 37 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of geometry artifacts in generated videos and our results.
  • Figure 2: Overview of our framework, consisting of two components: (a) Geometric-Based Reward Model: a geometry-aware sampling (GAS) module leverages global attention of VGGT to identify salient patches, and a reward module computes cross-frame pointwise reprojection error; (b) Geometric Preference Alignment: the model is aligned via SFT lee2023alignt2i and DPO liu2025improvevidgen on a bidirectional model, or test-time scaling (TTS) with our reward as a path verifier on a causal modelzhu2026causalforcing.
  • Figure 3: Visualization of Geometry-Aware Sampling, which shows that the global attention of VGGT naturally captures the background geometry. We select top-$\tau$ percentage of attention-emphasized patches and sample at the center of each patch.
  • Figure 4: Budget Evaluation of TTS. All three methods—SoS, SoP, and Beam Search—demonstrate scaling tendencies as the search budget increases. Beam Search prevails in 3D metrics, while SoP achieves the best overall performance.
  • Figure 5: Qualitative Results of Test-Time Scaling. The baseline exhibits geometric artifacts (highlighted by $\times$), wrong perspective relation as shown in the last two frames. Our approach, whether optimizing over the initial seed (SoS), selecting frame-by-frame along the temporal axis (SoP), or applying beam search (BS), consistently produces geometrically coherent videos with no visible artifacts (highlighted by ✓).