Table of Contents
Fetching ...

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, Xihui Liu

TL;DR

The paper identifies a gap in long-duration video generation due to the lack of high-quality long-take datasets with dense captions. It introduces LVD-2M, a 2 million video-caption dataset curated from 220 million videos using a multi-stage filtering pipeline (scene-cut detection, optical flow for motion, and semantic filtering with MLLMs) and a hierarchical captioning system that produces temporally dense captions via image-grid VLM captioning plus LLM-based refinement. The authors validate the dataset with human evaluations and show that finetuning diffusion-based and LM-based video generation models on LVD-2M improves long-range dynamic generation and temporal consistency, including extending diffusion-based T2V models to 65 frames. They demonstrate that LVD-2M enhances long-take video quality, motion dynamics, and caption richness, offering a practical resource for advancing long-range video generation research and applications, while acknowledging scale and social considerations.

Abstract

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

TL;DR

The paper identifies a gap in long-duration video generation due to the lack of high-quality long-take datasets with dense captions. It introduces LVD-2M, a 2 million video-caption dataset curated from 220 million videos using a multi-stage filtering pipeline (scene-cut detection, optical flow for motion, and semantic filtering with MLLMs) and a hierarchical captioning system that produces temporally dense captions via image-grid VLM captioning plus LLM-based refinement. The authors validate the dataset with human evaluations and show that finetuning diffusion-based and LM-based video generation models on LVD-2M improves long-range dynamic generation and temporal consistency, including extending diffusion-based T2V models to 65 frames. They demonstrate that LVD-2M enhances long-take video quality, motion dynamics, and caption richness, offering a practical resource for advancing long-range video generation research and applications, while acknowledging scale and social considerations.

Abstract

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Paper Structure

This paper contains 28 sections, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Comparison of our proposed LVD-2M dataset against previous datasets. Our dataset contains long-take videos with significant motion and temporally-dense captions (different colors represent captions for different frames), contrasting with short videos and sparse annotations in previous datasets like Panda-70M chen2024panda70m, HD-VG hdvg, and WebVid Bain21webvid (shown as "Others").
  • Figure 2: Video filtering process. Our video filtering process employs multiple criteria to select high-quality, dynamic, and long-take videos from four source datasets.
  • Figure 3: Hierarchical video captioning process. First, we split the long video into 30-second clips and compose them into image grids. Then, we use the LLaVA-1.6 model liu2024llavanext to generate captions for each image grid. Finally, we use the Claude3-Haiku model Claude3-Haiku to refine and merge these captions into the final complete caption for the whole video.
  • Figure 4: Statistics of LVD-2M. LVD-2M consists of long video clips with detailed dense captions, and diverse categories.
  • Figure 5: The distribution of human-rated dynamic degree score and human preference for caption quality, comparing LVD-2M with other video datasets.
  • ...and 16 more figures