Table of Contents
Fetching ...

Dyadic Interaction Modeling for Social Behavior Generation

Minh Tran, Di Chang, Maksim Siniukov, Mohammad Soleymani

TL;DR

Dyadic Interaction Modeling (DIM) addresses naturalistic listener responses by jointly modeling speaker and listener motions through self-supervised masked pre-training on large-scale dyadic data. It learns discrete motion priors $z^{(s)}$ and $z^{(l)}$ via two VQ-VAE encoders, optimizes a contrastive loss $ ext{L}_c$, and uses a joint transformer-based decoder to produce continuous motions for both participants (DIM-Listener and DIM-Speaker). Extensive experiments on ViCo, LM_Listener, and BiWi demonstrate state-of-the-art performance in diversity, realism, and synchrony, with photorealistic rendering via PIRenderer. Limitations include renderer identity-specific fine-tuning and reliance on EMOCA representations, suggesting future work on generalized rendering models and richer motion representations. Overall, the approach advances practical human-computer interaction, VR, and synthetic media by enabling context-aware dyadic social behavior generation.

Abstract

Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures. The code is available at https://github.com/Boese0601/Dyadic-Interaction-Modeling

Dyadic Interaction Modeling for Social Behavior Generation

TL;DR

Dyadic Interaction Modeling (DIM) addresses naturalistic listener responses by jointly modeling speaker and listener motions through self-supervised masked pre-training on large-scale dyadic data. It learns discrete motion priors and via two VQ-VAE encoders, optimizes a contrastive loss , and uses a joint transformer-based decoder to produce continuous motions for both participants (DIM-Listener and DIM-Speaker). Extensive experiments on ViCo, LM_Listener, and BiWi demonstrate state-of-the-art performance in diversity, realism, and synchrony, with photorealistic rendering via PIRenderer. Limitations include renderer identity-specific fine-tuning and reliance on EMOCA representations, suggesting future work on generalized rendering models and richer motion representations. Overall, the approach advances practical human-computer interaction, VR, and synthetic media by enabling context-aware dyadic social behavior generation.

Abstract

Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures. The code is available at https://github.com/Boese0601/Dyadic-Interaction-Modeling
Paper Structure (21 sections, 10 equations, 6 figures, 6 tables)

This paper contains 21 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We propose Dyadic Interaction Modeling, a pre-training strategy that jointly models speakers’ and listeners’ motions and learns representations that capture the dyadic context. We then utilize the pre-trained weights and feed multimodal inputs from the speaker into DIM-Listener. DIM-Listener is capable of generating photorealistic videos for the listener's motion.
  • Figure 2: Dyadic Interaction Modeling learns a unified speaker-listener representation from dyadic interactions. 1) The framework takes both the ground-truth speaker motion $s$ and the listener motion $l$ as input. 2) VQ-Encoders of speaker and listener then encode the motions to discrete units (discrete latent codes) $z^{(s)}$ and $z^{(l)}$. 3) The masked speaker's and listener's motions are further encoded and concatenated so that a unified representation is learned with contrastive loss. 4) Then, the split unified representation and speaker audio feature $a$ are decoded into discrete unit predictions $z'^{(s)}$ and $z'^{(l)}$ supervised by cross-entropy loss. 5) Finally, the generated speaker motions $s'$ and listener motions $l'$ are decoded from these discrete unit predictions to optimize the reconstruction loss.
  • Figure 3: For fine-tuning the model on listener motion generation, speaker input is not masked, and the listener input is entirely masked. We train the framework with the same cross-entropy loss and reconstruction loss from Dyadic Interaction Modeling while keeping the weights of listener VQ-Encoder fixed.
  • Figure 4: Comparison with L2L ng2022learning(pre-trained on CANDOR reece2023candor), ELP song2023emotional, and RLHG ECCV2022. Our method can generate diverse head movements while maintaining facial expressions that better align with speakers' sentiments.
  • Figure 5: DIM-Listener can generate diverse listener emotions (e.g. Happy and Angry) and facial behaviors (e.g. Eyes Blinking and Head Shaking).
  • ...and 1 more figures