Table of Contents
Fetching ...

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu, Ifeoma Nwogu, Guo-Jun Qi, Mitch Hill

TL;DR

This work tackles open-domain text-driven multi-person motion generation by introducing a diffusion-based model with interleaved pose and motion transformer layers and a two-stage sampling process. It jointly trains across multiple data sources, including the newly created LAION-Pose and WebVid-Motion datasets, to produce multi-person motions for an arbitrary number of subjects guided by textual prompts. The pose-to-motion two-stage framework uses a middle-frame pose conditioned on text to animate sequences, optimized via a denoising objective $L$, and represents people with SMPL vectors $24×3$ for pose and additional shape parameters. The authors demonstrate both qualitative and quantitative advantages over baselines, providing decomposed evaluation that validates frame-level pose quality and per-subject motion realism, and they release the datasets to spur future research in open-domain multi-person motion synthesis.

Abstract

This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

TL;DR

This work tackles open-domain text-driven multi-person motion generation by introducing a diffusion-based model with interleaved pose and motion transformer layers and a two-stage sampling process. It jointly trains across multiple data sources, including the newly created LAION-Pose and WebVid-Motion datasets, to produce multi-person motions for an arbitrary number of subjects guided by textual prompts. The pose-to-motion two-stage framework uses a middle-frame pose conditioned on text to animate sequences, optimized via a denoising objective , and represents people with SMPL vectors for pose and additional shape parameters. The authors demonstrate both qualitative and quantitative advantages over baselines, providing decomposed evaluation that validates frame-level pose quality and per-subject motion realism, and they release the datasets to spur future research in open-domain multi-person motion synthesis.

Abstract

This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.
Paper Structure (17 sections, 3 equations, 6 figures, 3 tables)

This paper contains 17 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We jointly train with multiple data sources including motion capture data and pose/motion extracted from image/video datasets. The model generates motion sequences from text for an arbitrary number of subjects.
  • Figure 2: Dataset visualizations. Top 2 rows: LAION-Pose dataset. Left is original image from LAION-400M schuhmann2021laion, right is BEV sun2022bev detection. Bottom 2 rows: Webvid-Motion dataset. Left is original video first frame from WebVid-10M bain21webvid, right is the motion sequence estimated by TRACE sun2023trace visualized from a different camera angle.
  • Figure 3: Our model is a diffusion framework consisting of interleaving pose and motion layers. At each pose/motion layer, we reshape the temporal/subject dimension into the batch dimension so that the layer focuses on generating per frame subject interaction and per-subject temporal movements respectively. Each layer is implemented as a transformer encoder. Diffusion time steps and text or pose conditions are encoded and summed up as a condition token concatenated to the beginning of the sequence.
  • Figure 4: Qualitative result for text-to-pose generation.
  • Figure 5: Qualitative results for text-to-motion generation.
  • ...and 1 more figures