Table of Contents
Fetching ...

Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

Zhiyu Tan, Junyan Wang, Hao Yang, Luozheng Qin, Hesen Chen, Qiang Zhou, Hao Li

TL;DR

This work tackles core bottlenecks in text-to-video generation by coupling a high-quality, coarse-to-fine curated dataset (CFC-Vids-1M) with a transformer-based diffusion model (RACCOON) that employs decoupled spatial-temporal attention. A progressive four-stage training pipeline bridges semantic grounding, temporal dynamics, high-resolution synthesis, and final aesthetic refinement, enabling efficient generation of photorealistic, temporally coherent videos. Empirical results on UCF-101, augmented by human judgments and VBench benchmarks, demonstrate superior fidelity, text-video alignment, and motion realism compared with prior methods, while maintaining computational efficiency. The dataset and model releases, along with thorough ablations and ethical considerations, establish a practical, scalable pathway for advancing responsible video synthesis research and applications.

Abstract

Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.

Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

TL;DR

This work tackles core bottlenecks in text-to-video generation by coupling a high-quality, coarse-to-fine curated dataset (CFC-Vids-1M) with a transformer-based diffusion model (RACCOON) that employs decoupled spatial-temporal attention. A progressive four-stage training pipeline bridges semantic grounding, temporal dynamics, high-resolution synthesis, and final aesthetic refinement, enabling efficient generation of photorealistic, temporally coherent videos. Empirical results on UCF-101, augmented by human judgments and VBench benchmarks, demonstrate superior fidelity, text-video alignment, and motion realism compared with prior methods, while maintaining computational efficiency. The dataset and model releases, along with thorough ablations and ethical considerations, establish a practical, scalable pathway for advancing responsible video synthesis research and applications.

Abstract

Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.

Paper Structure

This paper contains 26 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of data curation. Firstly, we employ a scene splitting algorithm to divide long videos with multiple scenes into single scene shots. We filter and sample videos based on five aspects: video quality, OCR, temporal consistency, category, and motion. Finally, we use a Large Language Model (LLM) to curate video-text pairs for error captions.
  • Figure 2: Comparison of statistics between uncurated and curated datasets. (a) through (d) present comparative statistics of uncurated and curated datasets across multiple dimensions: (a) aesthetics, (b) motion, (c) Optical Character Recognition (OCR), (d) temporal consistency.
  • Figure 3: The distribution of categories in Raccoon. The dataset contains a total of 14 categories, with a balanced distribution across the primary categories.
  • Figure 4: Four-stage training pipeline. Leverages pre-trained text-to-image models to establish semantic understanding capabilities as the foundation for video generation. Jointly trains image and video data at low resolution to efficiently optimize temporal modules. Enhances spatial details and temporal coherence through high-resolution training, enabling long video generation. Fine-tunes the model using a curated high-quality dataset to improve visual consistency and aesthetic quality of generated videos.
  • Figure 5: Qualitative Results. Example videos generated by our method at a resolution of 512 × 512 pixels, with a duration of 4 seconds at 8 frames per second. Our model is capable of generating temporally consistent, photorealistic videos that align with the provided prompts.
  • ...and 4 more figures