Table of Contents
Fetching ...

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

Yuanzhi Liang, Xuan'er Wu, Yirui Liu, Yijie Fang, Yizhen Fan, Ke Hao, Rui Li, Ruiying Liu, Ziqi Ni, Peng Yu, Yanbo Wang, Haibin Huang, Qizhen Weng, Chi Zhang, Xuelong Li

TL;DR

TeleBoost reframes video post-training as a disciplined, multi-stage optimization pipeline that converts pretrained video generators into robust, controllable production models. The approach combines Stage I supervised fine-tuning to establish a stable reference, Stage II GRPO-based reinforcement learning with ViPO,BPGO, and self-paced curricula to achieve perceptual fidelity and temporal coherence under challenging, high-cost rollouts, and Stage III Direct Preference Optimization to capture holistic human judgments beyond explicit rewards. A dedicated reward-modeling architecture integrates semantic, temporal-physics, and safety signals, while infrastructure innovations (Ray-based parallelism and memory-efficient DPO via decoupled gradients) enable scalable, end-to-end training. The results demonstrate improved motion coherence, prompt adherence, and subject stability, with strong human-preference gains and qualitative demonstrations across diverse scenarios. TeleBoost offers a practical blueprint for deploying long-horizon, instruction-following video generation systems with reliable training feedback, stability, and generalization in real-world settings.

Abstract

Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

TL;DR

TeleBoost reframes video post-training as a disciplined, multi-stage optimization pipeline that converts pretrained video generators into robust, controllable production models. The approach combines Stage I supervised fine-tuning to establish a stable reference, Stage II GRPO-based reinforcement learning with ViPO,BPGO, and self-paced curricula to achieve perceptual fidelity and temporal coherence under challenging, high-cost rollouts, and Stage III Direct Preference Optimization to capture holistic human judgments beyond explicit rewards. A dedicated reward-modeling architecture integrates semantic, temporal-physics, and safety signals, while infrastructure innovations (Ray-based parallelism and memory-efficient DPO via decoupled gradients) enable scalable, end-to-end training. The results demonstrate improved motion coherence, prompt adherence, and subject stability, with strong human-preference gains and qualitative demonstrations across diverse scenarios. TeleBoost offers a practical blueprint for deploying long-horizon, instruction-following video generation systems with reliable training feedback, stability, and generalization in real-world settings.

Abstract

Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.
Paper Structure (90 sections, 12 equations, 17 figures, 9 tables)

This paper contains 90 sections, 12 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Systematic overview of our video post-training framework. Starting from a pretrained video diffusion backbone, the pipeline proceeds through three staged optimizations: (I) supervised adaptation to establish a stable and controllable policy, (II) automatic-feedback optimization to improve alignment, perceptual quality, and temporal coherence under stability constraints, and (III) human-preference alignment to capture holistic judgments that are difficult to encode as explicit rewards. Evaluation and diagnostics act as cross-stage components, enabling slice-based analysis and reproducible validation.
  • Figure 2: Overview of Stage I supervised fine-tuning (SFT). Starting from a pretrained video diffusion backbone with a frozen encoder and diffusion transformer, SFT shapes the generator through a unified training objective that integrates instruction and control supervision, spatial-structure–aware constraints, and physics-aware motion supervision. All structural priors are injected conservatively at the decoder level, producing a stable and structurally constrained reference policy for subsequent post-training stages.
  • Figure 3: Stage-II GRPO optimization stack. Starting from the SFT-initialized policy, Stage II follows a closed-loop sample $\rightarrow$ evaluate $\rightarrow$ update pipeline. For each prompt group, the current policy generates multiple video rollouts, which are scored by a set of evaluators (e.g., visual quality, motion/temporal coherence, text alignment, safety, etc.). GRPO performs group-relative normalization to convert raw feedback into stable learning signals. On top of this backbone, we introduce modular refinements: structural credit Assignment (ViPO) lifts scalar feedback into spatiotemporal advantage maps for fine-grained credit assignment; prior-guided reward transform (BPGO) calibrates the trust of noisy/ambiguous supervision via prior-referenced uncertainty; and self-paced curriculum (Self-Paced GRPO) constructs an adaptive reward curriculum to mitigate reward saturation and late-stage stagnation. A multi-objective balance (Joint Reward) layer reconciles multi-objective trade-offs and produces the final optimization signal used to update the policy while maintaining stability.
  • Figure 4: ViPO: a perceptual structuring module builds spatial/temporal allocation maps used to convert scalar GRPO advantages into pixel/latent-level advantages.
  • Figure 5: BPGO: Reliability-Adaptive Scaling (RAS) reweights prompt groups based on deviation from prior rewards; Contrastive Reward Transformation (CRT) sharpens within-group discrimination.
  • ...and 12 more figures