Table of Contents
Fetching ...

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang

TL;DR

This work tackles the challenge of generating holistic 3D dyadic motion for both speakers and listeners in real conversations. It introduces HoCo, a large in-the-wild dataset with audio, transcripts, SMPL-X pseudo-GT, and listener-emotion annotations, and proposes a factorized audio representation combined with textual semantics to drive motion. A dual VQ-VAE per role and a chain-like transformer autoregressor enable simultaneous speaker/listener generation while modeling mutual influence, achieving state-of-the-art results on benchmarks. The approach has practical implications for VR, AI agents, and human-robot interaction by enabling coordinated, naturalistic nonverbal communication, with code and dataset release planned after acceptance.

Abstract

In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

TL;DR

This work tackles the challenge of generating holistic 3D dyadic motion for both speakers and listeners in real conversations. It introduces HoCo, a large in-the-wild dataset with audio, transcripts, SMPL-X pseudo-GT, and listener-emotion annotations, and proposes a factorized audio representation combined with textual semantics to drive motion. A dual VQ-VAE per role and a chain-like transformer autoregressor enable simultaneous speaker/listener generation while modeling mutual influence, achieving state-of-the-art results on benchmarks. The approach has practical implications for VR, AI agents, and human-robot interaction by enabling coordinated, naturalistic nonverbal communication, with code and dataset release planned after acceptance.

Abstract

In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.
Paper Structure (22 sections, 7 equations, 7 figures, 5 tables)

This paper contains 22 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustrations of two related tasks and our proposed holistic communicating body generation. Top row: Generation of the head zhang2023sadtalker, gesture zhi2023livelyspeaker, or holistic body yi2022generating for the speaker from speech; Bottom left: Responsive listening head synthesizes videos in responding to the speaker video stream zhou2022responsive and Responsive listening pose generation based on predicted pose history ahuja2019react; Bottom right: Our holistic communicating body generation from speech. We generate 3D holistic motions for both speakers and listeners simultaneously.
  • Figure 2: Holistic Communicating Body Generation Example. Given a talk between two participants, our method can generate coordinated and diverse communication. In (a) the speaker is on the left, and the listener is on the right. The listener laughs in response to the speaker's joke, accompanied by changes in body posture. In (b), the roles are switched. The speaker is on the right, and the listener is on the left. As the speaker narrates an unusual event, the listener expresses surprise with raised gestures and facial expressions in sync with the speaker.
  • Figure 3: Overview of the proposed framework for holistic communicating generation (a) audio feature extraction (b)VQ-VAE model for the generation of speaker and listener motion (c) Transformer-based autoregression model for simultaneously generating the motion of both speaker and listener in a chain-like manner.
  • Figure 4: In HOCO, we provide high-definition videos of two-person communication (top), as well as the corresponding p-GT estimated by SMPL-X (bottom).
  • Figure 5: (a) shows a piece of transcript and the corresponding audio signals with varying pitches and emotions. (b) displays the inference results from TalkSHOW yi2022generating, where the generated motions have low sensitivity to changes in the audio signals. (c) presents our inference results, demonstrating high consistency between the generated motions and the audio.
  • ...and 2 more figures