Table of Contents
Fetching ...

ProAct: A Dual-System Framework for Proactive Embodied Social Agents

Zeyi Zhang, Zixi Kang, Ruijie Zhao, Yusen Feng, Biao Jiang, Libin Liu

TL;DR

In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.

Abstract

Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.

ProAct: A Dual-System Framework for Proactive Embodied Social Agents

TL;DR

In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.

Abstract

Embodied social agents have recently advanced in generating synchronized speech and gestures. However, most interactive systems remain fundamentally reactive, responding only to current sensory inputs within a short temporal window. Proactive social behavior, in contrast, requires deliberation over accumulated context and intent inference, which conflicts with the strict latency budget of real-time interaction. We present \emph{ProAct}, a dual-system framework that reconciles this time-scale conflict by decoupling a low-latency \emph{Behavioral System} for streaming multimodal interaction from a slower \emph{Cognitive System} which performs long-horizon social reasoning and produces high-level proactive intentions. To translate deliberative intentions into continuous non-verbal behaviors without disrupting fluency, we introduce a streaming flow-matching model conditioned on intentions via ControlNet. This mechanism supports asynchronous intention injection, enabling seamless transitions between reactive and proactive gestures within a single motion stream. We deploy ProAct on a physical humanoid robot and evaluate both motion quality and interactive effectiveness. In real-world interaction user studies, participants and observers consistently prefer ProAct over reactive variants in perceived proactivity, social presence, and overall engagement, demonstrating the benefits of dual-system proactive control for embodied social interaction.
Paper Structure (44 sections, 3 equations, 8 figures, 4 tables)

This paper contains 44 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Architecture of the Cognitive System. The system operates in continuous cycles, with each cycle consisting of three stages: (a) collecting incremental multimodal inputs; (b) parallel execution of Context Encoder (memory compression) and Behavior Planner (proactive behavior prediction); (c) injecting behavior plans into the Behavioral System via different channels.
  • Figure 2: Demonstration of ProAct during a poster explanation task. The figure captures a continuous real-time session where the agent dynamically adapts its behavior based on visual and auditory triggers. Unlike passive systems, ProAct proactively captures the user's attention (top-left), orients to the user (top-right), guides the conversation to the poster (bottom-left), and actively clarifies a misconception (bottom-right). The top header specifies the scenario setup, while the beige panels visualize the Cognitive System, mapping real-time observations and triggers to specific behavioral plans. The generated gestures correspond to these intentions, where the red arrows in the overlays indicate the velocity and direction of each movement.
  • Figure 3: Demonstration of ProAct during a storytelling task. In this scenario, ProAct is tasked with maintaining user engagement while narrating a story. The agent detects a loss of engagement when the user disengages to interact with a smartphone (left panel) and actively intervenes to regain attention by instructing the user to set the device aside (right panel). Together with \ref{['fig:results1']}, these results demonstrate the agent's capability to handle diverse social scenarios.
  • Figure 4: Participant-based user study results. We compare the Full System (setting d) against w/o Cognitive System (setting b) across six experiential metrics. Each participant interacted with both systems in randomized order and rated them on a 5-point Likert scale. Error bars show standard deviations. Asterisks mark statistically significant differences ($p$ < 0.05). The Full System demonstrates significant improvements in Active Agency, Perceived Presence, Interaction Comfort, and Willingness to Re-engage, while maintaining comparable Motion Naturalness and Response Timeliness, confirming that proactive capabilities enhance user experience without compromising real-time performance.
  • Figure 5: Qualitative comparison of emotional support strategies. We compare a baseline reactive agent (top row) against ProAct (bottom row) in a scenario where the user exhibits visible distress. The baseline agent remains passive, treating the interaction as a standard Q&A task and offering analytical solutions ("What is the core issue?"), which fails to alleviate the user's agitation. In contrast, ProAct utilizes its Cognitive System to interpret non-verbal cues (e.g., pacing, sighing) as a trigger for intervention. As visualized in the beige panels, the agent actively approaches the user to inquire about their state and prioritizes emotional de-escalation through soothing gestures and empathetic dialogue ("I will always be here for you"), successfully improving the user's emotional state. red arrows indicate the velocity and trajectory of the generated comforting movements.
  • ...and 3 more figures