InterAct: Capture and Modelling of Realistic, Expressive and Interactive Activities between Two Persons in Daily Scenarios
Yinghao Huang, Leo Ho, Dafei Qin, Mingyi Shi, Taku Komura
TL;DR
This paper tackles the challenge of modeling interactive activities between two people in daily scenarios by capturing synchronized audio, body motion, and facial expressions. It introduces a diffusion-model based framework that jointly estimates motions for two agents from their speech and spatial relations, trained on the InterAct dataset. The InterAct dataset comprises 241 scenarios across 25 inter-personal relationships and 26 emotions, providing rich multi-modal data for long-duration interactions, with comprehensive quantitative and qualitative evaluation showing state-of-the-art performance in both body motion and facial expression generation. The work advances multi-modal two-person interaction modeling and has potential applications in film, computer graphics, and VR environments where realistic, controllable character interactions are required.
Abstract
We address the problem of accurate capture and expressive modelling of interactive behaviors happening between two persons in daily scenarios. Different from previous works which either only consider one person or focus on conversational gestures, we propose to simultaneously model the activities of two persons, and target objective-driven, dynamic, and coherent interactions which often span long duration. To this end, we capture a new dataset dubbed InterAct, which is composed of 241 motion sequences where two persons perform a realistic scenario over the whole sequence. The audios, body motions, and facial expressions of both persons are all captured in our dataset. We also demonstrate the first diffusion model based approach that directly estimates the interactive motions between two persons from their audios alone. All the data and code will be available at: https://hku-cg.github.io/interact.
