Growing with the Generator: Self-paced GRPO for Video Generation
Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li
TL;DR
This work tackles the problem of static reward models in reinforcement learning for video generation by introducing Self-Paced GRPO, a competence-aware framework where the reward signal co-evolves with the generator. The method defines a latent, multi-term reward with adaptive weights $w_j$ governed by a continuous curriculum through a three-stage progression: visual fidelity, temporal coherence, and text–video alignment, driven by a competence function $c$ and a transition function $g_j$. Empirically, Self-Paced GRPO yields consistent improvements across backbones (Wan2.1-T2V and HunyuanVideo) and reward models (including large VLMs like $\text{Qwen}_{2.5\text{VL}}-72\text{B}$) on VBench, while reducing reward bias and instability relative to fixed-reward baselines. The results demonstrate that adaptive reward learning can provide stable, scalable reinforcement-based alignment for high-dimensional generative video tasks, with potential for broader curriculum-based reward design in multimodal generation. Overall, the paper offers a practical, generalizable framework for evolving supervisory signals in post-training video generation and suggests promising directions for future research in curriculum learning and multi-objective RL in vision–language contexts.
Abstract
Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
