Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication
Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang
TL;DR
This work tackles the challenge of generating holistic 3D dyadic motion for both speakers and listeners in real conversations. It introduces HoCo, a large in-the-wild dataset with audio, transcripts, SMPL-X pseudo-GT, and listener-emotion annotations, and proposes a factorized audio representation combined with textual semantics to drive motion. A dual VQ-VAE per role and a chain-like transformer autoregressor enable simultaneous speaker/listener generation while modeling mutual influence, achieving state-of-the-art results on benchmarks. The approach has practical implications for VR, AI agents, and human-robot interaction by enabling coordinated, naturalistic nonverbal communication, with code and dataset release planned after acceptance.
Abstract
In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.
