Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
Zhiyu Tan, Junyan Wang, Hao Yang, Luozheng Qin, Hesen Chen, Qiang Zhou, Hao Li
TL;DR
This work tackles core bottlenecks in text-to-video generation by coupling a high-quality, coarse-to-fine curated dataset (CFC-Vids-1M) with a transformer-based diffusion model (RACCOON) that employs decoupled spatial-temporal attention. A progressive four-stage training pipeline bridges semantic grounding, temporal dynamics, high-resolution synthesis, and final aesthetic refinement, enabling efficient generation of photorealistic, temporally coherent videos. Empirical results on UCF-101, augmented by human judgments and VBench benchmarks, demonstrate superior fidelity, text-video alignment, and motion realism compared with prior methods, while maintaining computational efficiency. The dataset and model releases, along with thorough ablations and ethical considerations, establish a practical, scalable pathway for advancing responsible video synthesis research and applications.
Abstract
Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.
