Table of Contents
Fetching ...

InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, Lan Xu

TL;DR

InterGen tackles the lack of multi-human interactions in diffusion-based motion generation by introducing the InterHuman dataset and a diffusion model that symmetrically handles two interacting performers. The method uses cooperative weight-sharing transformer denoisers with a mutual attention mechanism, coupled with a non-canonical world-frame motion representation and two interaction-specific regularizers, including a loss-damping training scheme. Empirical results show improved realism, diversity, and text-motion alignment over baselines, enabling applications such as trajectory control and interaction inbetweening. The work provides a strong data and modeling baseline for text-guided human-to-human interactions with broad potential for VR/AR, gaming, and cinematic workflows.

Abstract

We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the multi-human interactions. In this paper, we present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process, which enables layman users to customize high-quality two-person interaction motions, with only text guidance. We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. For the algorithm side, we carefully tailor the motion diffusion model to our two-person interaction setting. To handle the symmetry of human identities during interactions, we propose two cooperative transformer-based denoisers that explicitly share weights, with a mutual attention mechanism to further connect the two denoising processes. Then, we propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame. We further introduce two novel regularization terms to encode spatial relations, equipped with a corresponding damping scheme during the training of our interaction diffusion model. Extensive experiments validate the effectiveness and generalizability of InterGen. Notably, it can generate more diverse and compelling two-person motions than previous methods and enables various downstream applications for human interactions.

InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions

TL;DR

InterGen tackles the lack of multi-human interactions in diffusion-based motion generation by introducing the InterHuman dataset and a diffusion model that symmetrically handles two interacting performers. The method uses cooperative weight-sharing transformer denoisers with a mutual attention mechanism, coupled with a non-canonical world-frame motion representation and two interaction-specific regularizers, including a loss-damping training scheme. Empirical results show improved realism, diversity, and text-motion alignment over baselines, enabling applications such as trajectory control and interaction inbetweening. The work provides a strong data and modeling baseline for text-guided human-to-human interactions with broad potential for VR/AR, gaming, and cinematic workflows.

Abstract

We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the multi-human interactions. In this paper, we present InterGen, an effective diffusion-based approach that incorporates human-to-human interactions into the motion diffusion process, which enables layman users to customize high-quality two-person interaction motions, with only text guidance. We first contribute a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 23,337 natural language descriptions. For the algorithm side, we carefully tailor the motion diffusion model to our two-person interaction setting. To handle the symmetry of human identities during interactions, we propose two cooperative transformer-based denoisers that explicitly share weights, with a mutual attention mechanism to further connect the two denoising processes. Then, we propose a novel representation for motion input in our interaction diffusion model, which explicitly formulates the global relations between the two performers in the world frame. We further introduce two novel regularization terms to encode spatial relations, equipped with a corresponding damping scheme during the training of our interaction diffusion model. Extensive experiments validate the effectiveness and generalizability of InterGen. Notably, it can generate more diverse and compelling two-person motions than previous methods and enables various downstream applications for human interactions.
Paper Structure (19 sections, 15 equations, 12 figures, 4 tables)

This paper contains 19 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: InterGen is capable of generating high-quality and diverse motions under complex interactions. It models the two-person symmetry with cooperative diffusion denoisers sharing the same motion manifold.
  • Figure 2: Our motion capture studio (top) and our collected InterHuman dataset illustration (bottom). The system comprises 76 calibrated multi-view cameras. InterHuman covers a wide range of two- person interactions.
  • Figure 3: InterHuman dataset consists of diverse human professional and daily interactions with diverse natural language annotations from different annotators. The figure showcases two examples of our dataset, martial arts, and social manners, with thorough descriptions from different perspectives.
  • Figure 4: InterHuman dataset covers a wide range of two- person interactions, from the daily ones like hugging, handshake, and argument to the professional motions ranging from dance to martial arts.
  • Figure 5: The overview of our InterGen. We contribute three primary technical designs. First, we propose an efficient two-person interaction motion representation. Second, we introduce two cooperative transformer-style weights-sharing networks with mutual attention to interactively perform denoising. Lastly, we introduce an effective loss function that significantly improves the quality of two-person interaction generation.
  • ...and 7 more figures