Table of Contents
Fetching ...

What Are You Doing? A Closer Look at Controllable Human Video Generation

Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Pieter-Jan Kindermans, Cordelia Schmid

TL;DR

The What Are You Doing? (wyd) paper presents a large, richly annotated benchmark for evaluating controllable human video generation, addressing the lack of fine-grained, diverse datasets in prior work. WYD collects 1,544 videos with 56 sub-categories across 9 aspects, plus manual segmentation masks, enabling both video- and human-level evaluations. It introduces validated metrics, including pAPE and pICD, and demonstrates that WYD is more challenging than existing benchmarks, exposing a range of novel failure modes across seven SOTA models. The work provides a standardized evaluation protocol and open data/code to accelerate progress in realistic, controllable human video synthesis, while acknowledging ethical and efficiency considerations for future research.

Abstract

High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.

What Are You Doing? A Closer Look at Controllable Human Video Generation

TL;DR

The What Are You Doing? (wyd) paper presents a large, richly annotated benchmark for evaluating controllable human video generation, addressing the lack of fine-grained, diverse datasets in prior work. WYD collects 1,544 videos with 56 sub-categories across 9 aspects, plus manual segmentation masks, enabling both video- and human-level evaluations. It introduces validated metrics, including pAPE and pICD, and demonstrates that WYD is more challenging than existing benchmarks, exposing a range of novel failure modes across seven SOTA models. The work provides a standardized evaluation protocol and open data/code to accelerate progress in realistic, controllable human video synthesis, while acknowledging ethical and efficiency considerations for future research.

Abstract

High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1{,}544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at https://github.com/google-deepmind/wyd-benchmark.

Paper Structure

This paper contains 67 sections, 45 figures, 5 tables.

Figures (45)

  • Figure 1: Samples from our wyd dataset (above) and from the commonly used datasets for controllable human video generation (below). wyd contains significantly more diverse videos, in terms of number of actors, actions, interactions, scenes as well as camera motion.
  • Figure 2: Diversity of wyd. wyd contains manually-labeled fine-grained annotations for 9 categories and 56 sub-categories relevant to human video generation. Manually annotating TikTok and TED-Talks, we observe their distinct lack of diversity across all nine categories.
  • Figure 3: Overall performance (left: video-level, right: human-level) of SOTA controllable image-to-video models on wyd$_{16}$. Pose models are shown in pink, depth ones in blue, and edge ones in orange. Human generation is multifaceted and no model prevails across all metrics.
  • Figure 4: Performance comparison between wyd, TikTok and TED-Talks for pose-conditioned models. wyd yields larger errors, confirming that its greater diversity is more challenging for models.
  • Figure 5: Example of SOTA depth-conditioned TF-T2V model failing to preserve people's identities.
  • ...and 40 more figures