It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model
Mingyi Shi, Dafei Qin, Leo Ho, Zhouyingcheng Liao, Yinghao Huang, Junichi Yamagishi, Taku Komura
TL;DR
The paper addresses real-time two-person co-speech motion generation by introducing an audio-driven, autoregressive diffusion framework conditioned on both speakers’ speech, past motions, and a trajectory, enabling interactive full-body motion generation. It introduces dual-stream generation with separated conditional tokens and a trajectory predictor to achieve controllable, context-aware interactions, and develops training strategies such as random masking and alternant root normalization to improve robustness. A new extensive InterAct++ dataset enriches two-person dynamic interactions beyond prior datasets. Empirical results across co-speech and interactive generation tasks show state-of-the-art motion quality, interaction realism, and real-time performance, demonstrating significant potential for VR, games, and virtual agents.
Abstract
Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person's audio and gestures will influence the other's responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model's ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic and interactive motions. We evaluate our system through multiple experiments to show it outperforms across a variety of tasks, including single and two-person co-speech motion generation, as well as interactive motion generation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.
