Table of Contents
Fetching ...

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

Mingyi Shi, Dafei Qin, Leo Ho, Zhouyingcheng Liao, Yinghao Huang, Junichi Yamagishi, Taku Komura

TL;DR

The paper addresses real-time two-person co-speech motion generation by introducing an audio-driven, autoregressive diffusion framework conditioned on both speakers’ speech, past motions, and a trajectory, enabling interactive full-body motion generation. It introduces dual-stream generation with separated conditional tokens and a trajectory predictor to achieve controllable, context-aware interactions, and develops training strategies such as random masking and alternant root normalization to improve robustness. A new extensive InterAct++ dataset enriches two-person dynamic interactions beyond prior datasets. Empirical results across co-speech and interactive generation tasks show state-of-the-art motion quality, interaction realism, and real-time performance, demonstrating significant potential for VR, games, and virtual agents.

Abstract

Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person's audio and gestures will influence the other's responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model's ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic and interactive motions. We evaluate our system through multiple experiments to show it outperforms across a variety of tasks, including single and two-person co-speech motion generation, as well as interactive motion generation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

TL;DR

The paper addresses real-time two-person co-speech motion generation by introducing an audio-driven, autoregressive diffusion framework conditioned on both speakers’ speech, past motions, and a trajectory, enabling interactive full-body motion generation. It introduces dual-stream generation with separated conditional tokens and a trajectory predictor to achieve controllable, context-aware interactions, and develops training strategies such as random masking and alternant root normalization to improve robustness. A new extensive InterAct++ dataset enriches two-person dynamic interactions beyond prior datasets. Empirical results across co-speech and interactive generation tasks show state-of-the-art motion quality, interaction realism, and real-time performance, demonstrating significant potential for VR, games, and virtual agents.

Abstract

Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person's audio and gestures will influence the other's responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model's ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic and interactive motions. We evaluate our system through multiple experiments to show it outperforms across a variety of tasks, including single and two-person co-speech motion generation, as well as interactive motion generation. To the best of our knowledge, this is the first system capable of generating interactive full-body motions for two characters from speech in an online manner.

Paper Structure

This paper contains 49 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Our system addresses a novel task, that takes the speech of two persons as input to generate dynamic full-body interactions in real-time. To achieve this, we designed an audio-driven, auto-regressive diffusion model that generates two-person motion, with the guidance of motion trajectory to improve controllability. To enrich the diversity of these interactions, we captured a new dataset that includes a wide range of daily conversational scenarios, and short-order execution.
  • Figure 2: Concept diagram.Our system obtains two persons' speech as input to generate full-body motion. We employ Large-Speech-Model (LSM) to extract the semantic token, which are then fed into our autoregressive motion generation module, to produce interactive motion with the guidance of predicted trajectory.
  • Figure 3: The overview of our autoregressive motion generator. Through a dual streaming design, the motion of two persons are generated simultaneously. For each prediction step, the generative diffusion model receives a separated token as a condition to predict plausible future motion, and then the selected frames from the predicted motion are utilized as the conditions for the next step generation. Unlike other sequential generation methods, which often struggle to quickly adapt to changes in another person's motion, our autoregressive manner can react to the partner's motion effectively, ensuring a more realistic interaction.
  • Figure 4: Our generator delivers realistic interaction between two people in sync with the speech.
  • Figure 5: The qualitative comparison among various co-speech methods. We use a consistent SMPL-X representation for the mesh rendering, except for Audio2Photoreal li2021audio2gestures which provides its own mesh template. Single-person co-speech methods, such as Audio2Photoreal, EMAGE emage2024 and AMUSE Chhatre_2024_CVPR, fall short in capturing interaction and producing dynamic motion. While the trajectory-based method LDA alexanderson2023listen succeeds in creating dynamic motion, its lack of a reactive generation mechanism hampers realism.
  • ...and 5 more figures