Table of Contents
Fetching ...

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang

TL;DR

GT-SVJ tackles temporally aware reward modeling for video generation by repurposing a strong video generator as a temporally grounded reward model. It builds a two-stage framework: a discriminative model trained via self-supervised energy-based contrastive learning with real, generated, and perturbation-based negatives, and a reward model derived from the discriminative features using aspect-wise predictions and a Bradley–Terry style ranking loss. The approach achieves state-of-the-art alignment on GenAI-Bench and MonteBench with only about 30K human annotations, demonstrating strong data efficiency and robust temporal sensitivity, while remaining competitive on VideoReward-Bench. This work suggests that leveraging temporally aware generative representations and carefully crafted hard negatives can yield more stable, granular video reward signals for human preference alignment.

Abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

TL;DR

GT-SVJ tackles temporally aware reward modeling for video generation by repurposing a strong video generator as a temporally grounded reward model. It builds a two-stage framework: a discriminative model trained via self-supervised energy-based contrastive learning with real, generated, and perturbation-based negatives, and a reward model derived from the discriminative features using aspect-wise predictions and a Bradley–Terry style ranking loss. The approach achieves state-of-the-art alignment on GenAI-Bench and MonteBench with only about 30K human annotations, demonstrating strong data efficiency and robust temporal sensitivity, while remaining competitive on VideoReward-Bench. This work suggests that leveraging temporally aware generative representations and carefully crafted hard negatives can yield more stable, granular video reward signals for human preference alignment.

Abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: to fewer than existing VLM-based approaches.
Paper Structure (17 sections, 12 equations, 8 figures, 2 tables)

This paper contains 17 sections, 12 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: GT-SVJ in action. Given two videos, our self-supervised model evaluates and ranks them on preferences, outperforming baselines on human preference alignment.
  • Figure 2: Overview of the proposed GT-SVJ framework. The framework consists of two stages: (top)Training a discriminative model, where the video generative model (CogVideoX) is adapted using a contrastive energy-based objective with real, generated, and perturbed videos, and (middle and bottom)Training a reward model, where the discriminative model (DM) is aligned with human ratings through aspect-wise prediction (AWP) via regression (middle) followed by relative preference modeling (bottom).
  • Figure 3: Illustration of energy trajectories predicted by our energy-based model. For the real video in (a), energy trajectory across the time steps is smooth and stable, indicating consistent temporal dynamics. In contrast, for the generated videos in (b) and (c), the energy values fluctuate erratically, reflecting spatial and temporal inconsistencies such as implausible scene lighting and motions.
  • Figure 4: Effect of LoRA placement within the backbone transformer. We compare applying LoRA to the initial third, middle third, and last third of the transformer layers. The middle-layer configuration achieves the best overall performance, while the last-layer configuration provides faster training with minimal loss in accuracy.
  • Figure 5: Effect of the discriminative model. Initializing the reward model with the trained discriminative model leads to lower validation losses and higher validation accuracies throughout training.
  • ...and 3 more figures