Table of Contents
Fetching ...

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu

TL;DR

OpenHumanVid addresses the scarcity of high-quality, human-centric video data by introducing a large-scale dataset with detailed captions, skeletal motion annotations, and aligned audio. The authors couple this data with an extended diffusion-transformer framework and demonstrate that pretraining on OpenHumanVid improves human-centric video generation while preserving general video capabilities. They highlight the critical role of precise text-to-appearance, motion, and facial-motion alignment for output quality. The work also discusses safety and governance to mitigate privacy concerns and potential misuse.

Abstract

Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

TL;DR

OpenHumanVid addresses the scarcity of high-quality, human-centric video data by introducing a large-scale dataset with detailed captions, skeletal motion annotations, and aligned audio. The authors couple this data with an extended diffusion-transformer framework and demonstrate that pretraining on OpenHumanVid improves human-centric video generation while preserving general video capabilities. They highlight the critical role of precise text-to-appearance, motion, and facial-motion alignment for output quality. The work also discusses safety and governance to mitigate privacy concerns and potential misuse.

Abstract

Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid

Paper Structure

This paper contains 7 sections, 6 figures.

Figures (6)

  • Figure 1: Videos we keep and deleted based on different quality filters
  • Figure 2: The illustration of textual captions and corresponding types (long, short, and structured captions).
  • Figure 3: The illustration of human skeleton sequences with respect to given videos.
  • Figure 4: The illustration of speech audio with respect to given videos. Left: video screenshot; right: speech script.
  • Figure 5: Face and body consistency comparison between baseline and ours.
  • ...and 1 more figures