Table of Contents
Fetching ...

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song

Abstract

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.

ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Abstract

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.
Paper Structure (14 sections, 12 equations, 5 figures, 4 tables)

This paper contains 14 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of ECHO and pipeline comparison with existing IHG approaches. (a) Approaches zhu2025infpcai2025towardssun2025streamavatar that rely exclusively on dual-track audio signals for avatar FBs generation; (b) Approaches peng2025dualtalkki2026avatar that condition avatar FBs generation on short-range-only context modeling via role-agnostic fusion; (c) Our ECHO, which performs long-range contextual understanding and region-wise decoupled cross-attention mechanism, achieving contextually appropriate and precisely lip-synchronized avatar FBs.
  • Figure 2: The pipeline of our proposed ECHO. The proposed LCU component starts by extracting user's long-range audio-visual behavioral features ($\mathbf{A}^{1:T}_{\text{usr}}$ and $\mathbf{V}^{1:T}_{\text{usr}}$) to perform low-level perception encoding along with high-level behavioral context understanding (resulting in perception-understanding representation $\textbf{f}_{\text{pu}}$). Then, linguistic dialogue context is leveraged to infer avatar's descriptive emotional state (resulting in embeddings $\mathbf{c}^{\text{emo}}$). Subsequently, two obtained representations are utilized as conditioning inputs for proposed block-wise SDCM module to guide avatar FBs generation. Extended pipeline details of avatar generator are provided in Appendix 2.1.
  • Figure 3: Qualitative comparison with open-sourced SOTA methods on IHG. Above two dyadic scenarios including active listening/speaking showcase that our proposed ECHO generates avatar head with more context-consistent and emotionally appropriate facial behaviors, and achieves lip articulation with better audio–visual alignment.
  • Figure 4: Ablation study: (a) Top-left: Ablation on low-level perception encoding by leveraging long-range visual cues; (b) Bottom Left: Ablation on high-level behavior-grounded context understanding; (c) Top-right: Ablation on SDCM module; (d) Bottom-right: Ablation on linguistic-driven affective understanding.
  • Figure 5: Qualitative comparison with state-of-the-art Interactive Head Generation (IHG) approaches, including INFP zhu2025infp, ARIG guo2025arig, and StreamAvatar sun2025streamavatar, respectively. As all compared SOTA methods are not open-sourced, we follow their established visual comparison protocols and evaluate on two samples from the DyConv dataset introduced by INFP zhu2025infp.