Table of Contents
Fetching ...

Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents

Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, Libin Liu

TL;DR

Social Agent presents a novel dyadic nonverbal generation framework that combines an autoregressive dual-person diffusion model with an LLM-powered Social Agent to plan and regulate inter-personal behaviors. The Scene Designer and Dynamic Controller modules analyze dialogue context and upcoming turns to produce proxemic, gaze, and gesture guidance, which are translated into motion constraints via an interaction-guided diffusion process. Through comprehensive datasets, user studies, and quantitative metrics, the approach demonstrates improved human likeness, beat alignment, and interaction level, indicating stronger engagement and naturalism in dyadic conversations. The work bridges high-level social reasoning and low-level motion synthesis, enabling scalable, interactive two-person motion generation with potential applications in virtual agents, social robotics, and immersive storytelling.

Abstract

We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors.

Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents

TL;DR

Social Agent presents a novel dyadic nonverbal generation framework that combines an autoregressive dual-person diffusion model with an LLM-powered Social Agent to plan and regulate inter-personal behaviors. The Scene Designer and Dynamic Controller modules analyze dialogue context and upcoming turns to produce proxemic, gaze, and gesture guidance, which are translated into motion constraints via an interaction-guided diffusion process. Through comprehensive datasets, user studies, and quantitative metrics, the approach demonstrates improved human likeness, beat alignment, and interaction level, indicating stronger engagement and naturalism in dyadic conversations. The work bridges high-level social reasoning and low-level motion synthesis, enabling scalable, interactive two-person motion generation with potential applications in virtual agents, social robotics, and immersive storytelling.

Abstract

We present Social Agent, a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations. In this framework, we develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants. Additionally, we propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals. The output of the agentic system is translated into high-level guidance for the gesture generator, resulting in realistic movement at both the behavioral and motion levels. Furthermore, the agentic system periodically examines the movements of interlocutors and infers their intentions, forming a continuous feedback loop that enables dynamic and responsive interactions between the two participants. User studies and quantitative evaluations show that our model significantly improves the quality of dyadic interactions, producing natural, synchronized nonverbal behaviors.

Paper Structure

This paper contains 44 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our framework models dyadic interactions by integrating an autoregressive diffusion model for low-level motion generation with an LLM-based agentic system, Social Agent, for nonverbal behavior analysis. This system continuously analyzes and refines nonverbal behavior cues, dynamically guiding the diffusion model to generate natural interpersonal behaviors such as spatial positioning, gaze contact, and gesture synchrony
  • Figure 2: The Social Agent System consists of two key modules: the Scene Designer, which analyzes dialogue content to determine the initial proxemic setup at the start of the generation process; and the Dynamic Controller, which predicts upcoming interactions for each generation round using multiple predictors. The predicted control signals are then converted into constraints that guide the low-level diffusion model, ensuring coherent and context-aware nonverbal behavior generation.
  • Figure 3: Dyadic nonverbal behaviors generated by our system. Left: Scene Designer predicts the initial proxemic setup. Right: Dynamic Controller’s signals with corresponding target word (red for gaze, blue for gesture imitation, and green for nodding). Motion trend line show imitation patterns (blue: imitated character, red: imitator). The Scene Designer ensures scene-aware spatial arrangements, while the Dynamic Controller guides cohesive dyadic interactions.
  • Figure 4: Visualization of the Scene Designer Agent process workflow and results. The blue character is Character I, and the green character is Character II. The examples showcase the framework's scene analysis and understanding capabilities, illustrating how it designs realistic and contextually appropriate initial proxemic setups for different scenarios. This facilitates subsequent interaction control by the Dynamic Controller Agent module, ensuring more natural and context-aware interactions.
  • Figure 5: This example illustrates how the Spatial Relation Predictor conducts fine-grained spatial reasoning based solely on textual input. Red text in the input highlights the current spatial state of both characters. The 3D image on the right visualizes the input configuration but is not part of the model’s input. In the output, blue text emphasizes the model’s spatial reasoning process, such as the inferred direction and distance of Character I’s movement. This is a concise version of the agent’s output, preserving essential information.
  • ...and 8 more figures