Table of Contents
Fetching ...

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Kun Liu, Qi Liu, Xinchen Liu, Jie Li, Yongdong Zhang, Jiebo Luo, Xiaodong He, Wu Liu

TL;DR

HOIGen-1M tackles the lack of HOI-aligned video data by introducing a large-scale dataset of over 1M HOI videos with expressive, machine-verified captions. The authors implement an end-to-end pipeline that automates video curation with multimodal models and human checks, and they develop a Mixture-of-Multimodal-Experts (MoME) captioning framework to reduce hallucinations. They also propose HOI-specific evaluation metrics, CoarseHOIScore and FineHOIScore, to assess interactive content quality. Experimental results show that current T2V models struggle with HOI generation, but fine-tuning on HOIGen-1M significantly improves HOI rendering, validating the dataset’s utility for advancing HOI video generation and benchmarking.

Abstract

Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

TL;DR

HOIGen-1M tackles the lack of HOI-aligned video data by introducing a large-scale dataset of over 1M HOI videos with expressive, machine-verified captions. The authors implement an end-to-end pipeline that automates video curation with multimodal models and human checks, and they develop a Mixture-of-Multimodal-Experts (MoME) captioning framework to reduce hallucinations. They also propose HOI-specific evaluation metrics, CoarseHOIScore and FineHOIScore, to assess interactive content quality. Experimental results show that current T2V models struggle with HOI generation, but fine-tuning on HOIGen-1M significantly improves HOI rendering, validating the dataset’s utility for advancing HOI video generation and benchmarking.

Abstract

Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.

Paper Structure

This paper contains 12 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of HOIGen-1M. HOIGen-1M contains over one million video clips for HOI video generation with multiple types of HOI videos, diverse scenarios ($15,000+$ objects and $7,000+$ interaction types), and expressive captions.
  • Figure 2: Video frames uniformly sampled from the video generated by several current T2V models. Most videos do not adequately follow the prompt of loading a suitcase onto a bus, showing that current T2V models struggle to generate videos that align with HOI. A dashed box means the content does not match the prompt, and a solid box indicates it does.
  • Figure 3: Statistics of video clips in HOIGen-1M. The dataset includes multiple types of HOI and spans a range of clip durations. All videos have a resolution of at least 720p and include significant motions.
  • Figure 4: An illustration of the Mixture-of-Multimodal-Experts (MoME) strategy-based caption method.
  • Figure 5: Caption words statistics in HOIGen-1M. The distribution of word numbers shows the captions are high-quality and fine-grained, with an average length of 152 words. The distribution of actions and objects in the captions further demonstrates the diversity of the dataset. There are over 15,000 objects and over 7,000 interaction action types, making it possible to train a T2V model to simulate the real world. For clarity, we have only listed the categories with the highest frequency.
  • ...and 1 more figures