Table of Contents
Fetching ...

StyGazeTalk: Learning Stylized Generation of Gaze and Head Dynamics

Chengwei Shi, Chong Cao

TL;DR

StyGazeTalk addresses the lack of integrated gaze–head generation in speech-driven animation by treating gaze as a core dynamic modality and learning a continuous style space. It combines a pattern-informed generator with a contrastive style encoder and trains on HAGE, a high-fidelity multimodal dataset with synchronized audio, gaze, head pose, and expressions. The method achieves temporally coherent, style-consistent gaze–head motions and enables controllable style transfer, validated by objective metrics, qualitative results, and a user perception study. The work demonstrates that high-fidelity eye-tracking supervision is essential for realistic gaze dynamics and suggests broad potential for more expressive 3D avatars and human-agent interactions.

Abstract

Gaze and head movements play a central role in expressive 3D media, human-agent interaction, and immersive communication. Existing works often model facial components in isolation and lack mechanisms for generating personalized, style-aware gaze behaviors. We propose StyGazeTalk, a multimodal framework that synthesizes synchronized gaze-head dynamics with controllable styles. To support high-fidelity training, we construct HAGE, a high-precision multimodal dataset containing eye-tracking data, audio, head pose, and 3D facial parameters. Experiments show that our method produces temporally coherent, style-consistent gaze-head motions, enhancing realism in 3D face generation.

StyGazeTalk: Learning Stylized Generation of Gaze and Head Dynamics

TL;DR

StyGazeTalk addresses the lack of integrated gaze–head generation in speech-driven animation by treating gaze as a core dynamic modality and learning a continuous style space. It combines a pattern-informed generator with a contrastive style encoder and trains on HAGE, a high-fidelity multimodal dataset with synchronized audio, gaze, head pose, and expressions. The method achieves temporally coherent, style-consistent gaze–head motions and enables controllable style transfer, validated by objective metrics, qualitative results, and a user perception study. The work demonstrates that high-fidelity eye-tracking supervision is essential for realistic gaze dynamics and suggests broad potential for more expressive 3D avatars and human-agent interactions.

Abstract

Gaze and head movements play a central role in expressive 3D media, human-agent interaction, and immersive communication. Existing works often model facial components in isolation and lack mechanisms for generating personalized, style-aware gaze behaviors. We propose StyGazeTalk, a multimodal framework that synthesizes synchronized gaze-head dynamics with controllable styles. To support high-fidelity training, we construct HAGE, a high-precision multimodal dataset containing eye-tracking data, audio, head pose, and 3D facial parameters. Experiments show that our method produces temporally coherent, style-consistent gaze-head motions, enhancing realism in 3D face generation.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the proposed StyGazeTalk framework.
  • Figure 2: Overview of our method. (Left) At each time step t, the model takes audio $A_{t:t+M}$, past motion $X_{t-1:t+M}$, and a style code $s_{t-1}$, and predicts future motion $X_{t:t+N-1}$ using a conditional sequence generation model. (Right) The style code is extracted from past motion via a pretrained Transformer encoder, enabling style-aware, temporally coherent motion generation.
  • Figure 3: Structured gaze–head dynamics: fixations/saccades (yellow), eye–head coordination (gray), and speaker-specific style patterns.
  • Figure 4: Visual results of ablation study of our methods and qualitative comparison of state-of-the-art methods.
  • Figure 5: t-SNE of style embeddings. Colors indicate sessions; solid = ground truth, transparent = predictions.
  • ...and 1 more figures