Table of Contents
Fetching ...

InterAct: Capture and Modelling of Realistic, Expressive and Interactive Activities between Two Persons in Daily Scenarios

Yinghao Huang, Leo Ho, Dafei Qin, Mingyi Shi, Taku Komura

TL;DR

This paper tackles the challenge of modeling interactive activities between two people in daily scenarios by capturing synchronized audio, body motion, and facial expressions. It introduces a diffusion-model based framework that jointly estimates motions for two agents from their speech and spatial relations, trained on the InterAct dataset. The InterAct dataset comprises 241 scenarios across 25 inter-personal relationships and 26 emotions, providing rich multi-modal data for long-duration interactions, with comprehensive quantitative and qualitative evaluation showing state-of-the-art performance in both body motion and facial expression generation. The work advances multi-modal two-person interaction modeling and has potential applications in film, computer graphics, and VR environments where realistic, controllable character interactions are required.

Abstract

We address the problem of accurate capture and expressive modelling of interactive behaviors happening between two persons in daily scenarios. Different from previous works which either only consider one person or focus on conversational gestures, we propose to simultaneously model the activities of two persons, and target objective-driven, dynamic, and coherent interactions which often span long duration. To this end, we capture a new dataset dubbed InterAct, which is composed of 241 motion sequences where two persons perform a realistic scenario over the whole sequence. The audios, body motions, and facial expressions of both persons are all captured in our dataset. We also demonstrate the first diffusion model based approach that directly estimates the interactive motions between two persons from their audios alone. All the data and code will be available at: https://hku-cg.github.io/interact.

InterAct: Capture and Modelling of Realistic, Expressive and Interactive Activities between Two Persons in Daily Scenarios

TL;DR

This paper tackles the challenge of modeling interactive activities between two people in daily scenarios by capturing synchronized audio, body motion, and facial expressions. It introduces a diffusion-model based framework that jointly estimates motions for two agents from their speech and spatial relations, trained on the InterAct dataset. The InterAct dataset comprises 241 scenarios across 25 inter-personal relationships and 26 emotions, providing rich multi-modal data for long-duration interactions, with comprehensive quantitative and qualitative evaluation showing state-of-the-art performance in both body motion and facial expression generation. The work advances multi-modal two-person interaction modeling and has potential applications in film, computer graphics, and VR environments where realistic, controllable character interactions are required.

Abstract

We address the problem of accurate capture and expressive modelling of interactive behaviors happening between two persons in daily scenarios. Different from previous works which either only consider one person or focus on conversational gestures, we propose to simultaneously model the activities of two persons, and target objective-driven, dynamic, and coherent interactions which often span long duration. To this end, we capture a new dataset dubbed InterAct, which is composed of 241 motion sequences where two persons perform a realistic scenario over the whole sequence. The audios, body motions, and facial expressions of both persons are all captured in our dataset. We also demonstrate the first diffusion model based approach that directly estimates the interactive motions between two persons from their audios alone. All the data and code will be available at: https://hku-cg.github.io/interact.
Paper Structure (21 sections, 8 equations, 6 figures, 5 tables)

This paper contains 21 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We capture full-body interactions between two persons in daily scenarios. Left: The fact that the man is a 1960s pop star surprises his neighbor. Middle: Boss(Female) comforts the employee(Male) who receives a sad phone call. Right: Co-workers confess their romantic feelings.
  • Figure 2: Our method's pipeline begins by taking the raw audio signals from two individuals. It then incorporates a suite of conditions, including BERT devlin2018bert features, relative orientation and position data, as well as action labels. Utilizing these inputs, our approach employs two distinct diffusion models in tandem to generate lifelike and varied facial and body animations. The result of our algorithm is the simultaneous production of intricate 3D facial meshes and comprehensive global body movements for both subjects.
  • Figure 3: Actors during performance, showing the body and face capture setup
  • Figure 4: Histograms of the second actor's root position on the ground plane (XZ plane) relative to the first actor, for 4 different kinds of relationships. The unit is meter. Note how the relative position of the second actor with respect to the first actor varies depending on the relationship of the actors.
  • Figure 5: The heat map of face animation variance between different facing directions(Left), emotions(Right Top), and relationships (Right Bottom).
  • ...and 1 more figures