Table of Contents
Fetching ...

GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Zhangyu Jin, Maksim Siniukov, Deuksin Kwon, Ashutosh Chaubey, Mohammad Soleymani

Abstract

Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Abstract

Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

Paper Structure

This paper contains 18 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: GDPO-Listener. Our framework generates expressive speaking and listening head reactions from multimodal dyadic conversational inputs. By utilizing an expanded FLAME parameter space, it naturally supports eye blinking and head nodding. Furthermore, it enables explicit semantic text control to ensure contextually appropriate responses and maintains stable dynamics during long-sequence inference.
  • Figure 2: Necessity of Our Method. We resolve three critical gaps in previous baselines. (Left) Older methods suffer from static mean-collapse, but we synthesize high-variance expressive reactions. (Middle) Baselines cannot produce complex motions, whereas we enable natural eye blinks and head nods. (Right) Prior models often have no semantic guidance, but our text control ensures contextually appropriate responses.
  • Figure 3: GDPO-Listener Architecture. Our framework has two training stages. (a) Supervised Learning. Multimodal inputs are encoded as prefix conditions, and an Auto-Regressive Flow Matching model iteratively predicts actor motion latents from noise and history via ODE sampling. (b) Reinforcement Learning. We then post-train the policy model via GDPO. We compute fine-grained, decoupled rewards for distinct FLAME parameters under SDE sampling to explicitly optimize expressiveness.
  • Figure 4: Qualitative Comparisons. Other methods have low-expressive speaking and static listening, our method shows better lip sync and highly expressive reactions.
  • Figure 5: Advanced Generation Capabilities. (Top) Semantic text explicitly controls emotional states. (Middle) We sustain dynamic reactions during long sequences, avoiding baseline static decay. (Bottom) CFG scaling seamlessly modulates expressiveness intensity without retraining.
  • ...and 2 more figures