TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation
Yuanzhi Liang, Xuan'er Wu, Yirui Liu, Yijie Fang, Yizhen Fan, Ke Hao, Rui Li, Ruiying Liu, Ziqi Ni, Peng Yu, Yanbo Wang, Haibin Huang, Qizhen Weng, Chi Zhang, Xuelong Li
TL;DR
TeleBoost reframes video post-training as a disciplined, multi-stage optimization pipeline that converts pretrained video generators into robust, controllable production models. The approach combines Stage I supervised fine-tuning to establish a stable reference, Stage II GRPO-based reinforcement learning with ViPO,BPGO, and self-paced curricula to achieve perceptual fidelity and temporal coherence under challenging, high-cost rollouts, and Stage III Direct Preference Optimization to capture holistic human judgments beyond explicit rewards. A dedicated reward-modeling architecture integrates semantic, temporal-physics, and safety signals, while infrastructure innovations (Ray-based parallelism and memory-efficient DPO via decoupled gradients) enable scalable, end-to-end training. The results demonstrate improved motion coherence, prompt adherence, and subject stability, with strong human-preference gains and qualitative demonstrations across diverse scenarios. TeleBoost offers a practical blueprint for deploying long-horizon, instruction-following video generation systems with reliable training feedback, stability, and generalization in real-world settings.
Abstract
Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.
