TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation
Yixiang Zhuang, Chunshan Ma, Yao Cheng, Xuan Cheng, Jing Liao, Juncong Lin
TL;DR
TalkingEyes addresses the neglected problem of generating expressive 3D eye gaze from speech by building TKED, a large-scale audio-gaze dataset, and a dual latent space model that maps speech to head motion via a continuous VAE and to eye gaze via a discrete VQVAE. A temporal autoregressive cross-modal Transformer translates speech embeddings into compatible head and gaze codes, enabling diverse, natural eye gaze synchronized with speech, while eye blinks are generated from a data-driven statistic and mouth motion is derived from Learn2Talk for a holistic 3D avatar. The approach yields higher motion diversity and stronger audio-motion alignment than baselines, with perceptual user studies favoring its realism, and LightGazeFit provides competitive, low-resolution 3D eye gaze fitting. Together, these contributions advance phonetic-aligned, pluralistic 3D talking avatars suitable for immersive human-computer interaction and virtual characters.
Abstract
Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/
