Table of Contents
Fetching ...

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Runze Zhang, Guoguang Du, Xiaochuan Li, Qi Jia, Liang Jin, Lu Liu, Jingjing Wang, Cong Xu, Zhenhua Guo, Yaqian Zhao, Xiaoli Gong, Rengang Li, Baoyu Fan

TL;DR

This work introduces integral spatio-temporal consistency for video generation, addressing how camera movements interact with plot progression to affect previously generated content. It presents DropletVideo-10M, the largest open-source dataset of videos with dynamic camera motion and richly detailed captions, and DropletVideo, a diffusion-based model that preserves both temporal and spatial coherence while enabling controllable motion via a Motion Adaptive Generation mechanism. The approach combines a 3D causal VAE with a 3D modality-expert transformer and specialized training strategies, achieving strong 3D and integral spatio-temporal performance and competitive results against state-of-the-art image-to-video models. The work provides open-source data, models, and evaluation techniques, aiming to spur research into complex multi-plot narratives and camera-driven scene evolution in video generation with real-world applicability across media creation and AI-assisted filmmaking.

Abstract

Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

TL;DR

This work introduces integral spatio-temporal consistency for video generation, addressing how camera movements interact with plot progression to affect previously generated content. It presents DropletVideo-10M, the largest open-source dataset of videos with dynamic camera motion and richly detailed captions, and DropletVideo, a diffusion-based model that preserves both temporal and spatial coherence while enabling controllable motion via a Motion Adaptive Generation mechanism. The approach combines a 3D causal VAE with a 3D modality-expert transformer and specialized training strategies, achieving strong 3D and integral spatio-temporal performance and competitive results against state-of-the-art image-to-video models. The work provides open-source data, models, and evaluation techniques, aiming to spur research into complex multi-plot narratives and camera-driven scene evolution in video generation with real-world applicability across media creation and AI-assisted filmmaking.

Abstract

Spatio-temporal consistency is a critical research topic in video generation. A qualified generated video segment must ensure plot plausibility and coherence while maintaining visual consistency of objects and scenes across varying viewpoints. Prior research, especially in open-source projects, primarily focuses on either temporal or spatial consistency, or their basic combination, such as appending a description of a camera movement after a prompt without constraining the outcomes of this movement. However, camera movement may introduce new objects to the scene or eliminate existing ones, thereby overlaying and affecting the preceding narrative. Especially in videos with numerous camera movements, the interplay between multiple plots becomes increasingly complex. This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. Our research encompasses dataset construction through to the development of the model. Initially, we constructed a DropletVideo-10M dataset, which comprises 10 million videos featuring dynamic camera motion and object actions. Each video is annotated with an average caption of 206 words, detailing various camera movements and plot developments. Following this, we developed and trained the DropletVideo model, which excels in preserving spatio-temporal coherence during video generation. The DropletVideo dataset and model are accessible at https://dropletx.github.io.

Paper Structure

This paper contains 28 sections, 1 equation, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Comparisons between Composable Spatio-temporal Consistency and Integral Spatio-temporal Consistency. (a) Composable Spatio-Temporal Consistency refers to the straightforward combination of temporal and spatial consistency, without limiting the effects of camera movement. Studies such as MovieGen polyak2024moviegencastmedia and VBench++ huang2024vbench++ are dedicated to realizing this consistency. Despite the potential emergence of a new scene post camera movement, the introduced scene tends to be stationary, precluding the onset of further motion. (b) Integral Spatio-Temporal Consistency considers the interplay between plot development and camera techniques, along with the enduring influence of antecedent content on subsequent creation. This is because a camera movement may introduce or eliminate objects, thereby overlaying and impacting the preceding storyline. For example in the "Forrest Gump" clip, achieving integral spatio-temporal consistency requires incorporating the motion of the "car" as it recedes following the camera’s "turn right" action while maintaining the scene of Forrest running, ensuring that "Forrest Gump's right remains at a consistent distance", preserving the correct spatial relationships. Temporal consistency in plot progression is highlighted in the blue region, while the red region denotes spatial consistency induced by camera movement
  • Figure 2: The DropletVideo-10M dataset features diverse camera movements, long-captioned contextual descriptions, and strong spatio-temporal consistency. (a) Existing datasets, such as Panda-70M chen2024panda-70m, place less emphasis on camera movement and contain relatively brief captions. (b) In contrast, DropletVideo-10M consists of spatio-temporal videos that incorporate both camera movement and event progression. Each video is paired with a caption that conveys detailed spatio-temporal information aligned with the video content, with an average caption length of 206 words. The spatio-temporal information is highlighted in red in the figure.
  • Figure 3: The pipeline we proposed to curate the DropletVideo-10M dataset.
  • Figure 4: The aesthetics distribution and the image quality distribution of DropletVideo-10M. These distributions demonstrate that our dataset achieves high scores in both aesthetics and image quality, indicating an overall high-quality standard for the dataset.
  • Figure 5: Captions generated by the fine-tuned models, including InternVL2-8Bchen2024farchen2024internvl, ShareGPT4Video-8Bchen2024sharegpt4video, ShareCaptioner-videochen2024sharegpt4video, and MA-LMMhe2024ma. InternVL2-8Bchen2024farchen2024internvl captures intricate camera work and narrative elements with high efficacy.
  • ...and 13 more figures