Table of Contents
Fetching ...

Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models

Shuting Wang, Haihong Tang, Zhicheng Dou, Chenyan Xiong

TL;DR

HALO addresses the problem of localized patch-level defects in text-to-video generation by introducing a patch reward model trained via GPT-4o labels and distilled into a patch evaluator aligned with a video reward model. It incorporates Gran-DPO, a granular diffusion policy optimization that jointly leverages patch and video rewards to improve both local fidelity and global video quality. Empirical results on VBench and VideoScore show HALO consistently outperforms baselines and reveals that patch rewards provide distinct, complementary guidance beyond global rewards. The work demonstrates that targeting local defects through patch-level feedback can substantially enhance the reliability and realism of AI-generated videos while maintaining alignment with human judgments.

Abstract

The emergence of diffusion models (DMs) has significantly improved the quality of text-to-video generation models (VGMs). However, current VGM optimization primarily emphasizes the global quality of videos, overlooking localized errors, which leads to suboptimal generation capabilities. To address this issue, we propose a post-training strategy for VGMs, HALO, which explicitly incorporates local feedback from a patch reward model, providing detailed and comprehensive training signals with the video reward model for advanced VGM optimization. To develop an effective patch reward model, we distill GPT-4o to continuously train our video reward model, which enhances training efficiency and ensures consistency between video and patch reward distributions. Furthermore, to harmoniously integrate patch rewards into VGM optimization, we introduce a granular DPO (Gran-DPO) algorithm for DMs, allowing collaborative use of both patch and video rewards during the optimization process. Experimental results indicate that our patch reward model aligns well with human annotations and HALO substantially outperforms the baselines across two evaluation methods. Further experiments quantitatively prove the existence of patch defects, and our proposed method could effectively alleviate this issue.

Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models

TL;DR

HALO addresses the problem of localized patch-level defects in text-to-video generation by introducing a patch reward model trained via GPT-4o labels and distilled into a patch evaluator aligned with a video reward model. It incorporates Gran-DPO, a granular diffusion policy optimization that jointly leverages patch and video rewards to improve both local fidelity and global video quality. Empirical results on VBench and VideoScore show HALO consistently outperforms baselines and reveals that patch rewards provide distinct, complementary guidance beyond global rewards. The work demonstrates that targeting local defects through patch-level feedback can substantially enhance the reliability and realism of AI-generated videos while maintaining alignment with human judgments.

Abstract

The emergence of diffusion models (DMs) has significantly improved the quality of text-to-video generation models (VGMs). However, current VGM optimization primarily emphasizes the global quality of videos, overlooking localized errors, which leads to suboptimal generation capabilities. To address this issue, we propose a post-training strategy for VGMs, HALO, which explicitly incorporates local feedback from a patch reward model, providing detailed and comprehensive training signals with the video reward model for advanced VGM optimization. To develop an effective patch reward model, we distill GPT-4o to continuously train our video reward model, which enhances training efficiency and ensures consistency between video and patch reward distributions. Furthermore, to harmoniously integrate patch rewards into VGM optimization, we introduce a granular DPO (Gran-DPO) algorithm for DMs, allowing collaborative use of both patch and video rewards during the optimization process. Experimental results indicate that our patch reward model aligns well with human annotations and HALO substantially outperforms the baselines across two evaluation methods. Further experiments quantitatively prove the existence of patch defects, and our proposed method could effectively alleviate this issue.

Paper Structure

This paper contains 32 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: A generated video with local flaws (red boxed).
  • Figure 2: The framework of our proposed method, HALO.
  • Figure 3: The visualized comparison between our proposed model HALO and baselines.
  • Figure 4: Further analyses about our reward models and training process.
  • Figure 5: Two types of patch reward distributions.
  • ...and 4 more figures