StyGazeTalk: Learning Stylized Generation of Gaze and Head Dynamics
Chengwei Shi, Chong Cao
TL;DR
StyGazeTalk addresses the lack of integrated gaze–head generation in speech-driven animation by treating gaze as a core dynamic modality and learning a continuous style space. It combines a pattern-informed generator with a contrastive style encoder and trains on HAGE, a high-fidelity multimodal dataset with synchronized audio, gaze, head pose, and expressions. The method achieves temporally coherent, style-consistent gaze–head motions and enables controllable style transfer, validated by objective metrics, qualitative results, and a user perception study. The work demonstrates that high-fidelity eye-tracking supervision is essential for realistic gaze dynamics and suggests broad potential for more expressive 3D avatars and human-agent interactions.
Abstract
Gaze and head movements play a central role in expressive 3D media, human-agent interaction, and immersive communication. Existing works often model facial components in isolation and lack mechanisms for generating personalized, style-aware gaze behaviors. We propose StyGazeTalk, a multimodal framework that synthesizes synchronized gaze-head dynamics with controllable styles. To support high-fidelity training, we construct HAGE, a high-precision multimodal dataset containing eye-tracking data, audio, head pose, and 3D facial parameters. Experiments show that our method produces temporally coherent, style-consistent gaze-head motions, enhancing realism in 3D face generation.
