Table of Contents
Fetching ...

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, Kris Kitani

TL;DR

DyaDiT is presented, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals that surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation.

Abstract

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

TL;DR

DyaDiT is presented, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals that surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation.

Abstract

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
Paper Structure (25 sections, 4 equations, 9 figures, 1 table)

This paper contains 25 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: DyaDiT generates socially aware conversational gestures from dyadic audio, conditioned on social factors such as relationship and personality traits, achieving natural and contextually appropriate reactions that outperform prior methods in both quantitative and user evaluations.
  • Figure 2: Overview of DyaDiT. DyaDiT conditions on multiple input modalities, including audio, partner motion, relationship type, and personality scores. It employs an Audio Orthogonalization Cross Attention (ORCA) module to obtain cleaner audio representations and a motion dictionary to guide style aware gesture generation.
  • Figure 3: ORCA reduces ambiguity between the two audio streams, allowing DyaDiT to generate realistic motion even when one person interrupts the other during the conversation. The example demonstrates the generated motions adjusts naturally as the conversation shifts.
  • Figure 4: Qualitative Results. Comparison of visualization results between DyaDiT, ConvoFusionconvofusion2024, and Audio2PhotoRealAo2023GestureDiffuCLIP. The gestures generated by DyaDiT exhibit higher diversity and greater realism compared to the other methods.
  • Figure 5: Visualization results under different personality score conditionings. All samples are generated using classifier-free guidance with CFG = 2.5.
  • ...and 4 more figures