Table of Contents
Fetching ...

TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation

Yixiang Zhuang, Chunshan Ma, Yao Cheng, Xuan Cheng, Jing Liao, Juncong Lin

TL;DR

TalkingEyes addresses the neglected problem of generating expressive 3D eye gaze from speech by building TKED, a large-scale audio-gaze dataset, and a dual latent space model that maps speech to head motion via a continuous VAE and to eye gaze via a discrete VQVAE. A temporal autoregressive cross-modal Transformer translates speech embeddings into compatible head and gaze codes, enabling diverse, natural eye gaze synchronized with speech, while eye blinks are generated from a data-driven statistic and mouth motion is derived from Learn2Talk for a holistic 3D avatar. The approach yields higher motion diversity and stronger audio-motion alignment than baselines, with perceptual user studies favoring its realism, and LightGazeFit provides competitive, low-resolution 3D eye gaze fitting. Together, these contributions advance phonetic-aligned, pluralistic 3D talking avatars suitable for immersive human-computer interaction and virtual characters.

Abstract

Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/

TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation

TL;DR

TalkingEyes addresses the neglected problem of generating expressive 3D eye gaze from speech by building TKED, a large-scale audio-gaze dataset, and a dual latent space model that maps speech to head motion via a continuous VAE and to eye gaze via a discrete VQVAE. A temporal autoregressive cross-modal Transformer translates speech embeddings into compatible head and gaze codes, enabling diverse, natural eye gaze synchronized with speech, while eye blinks are generated from a data-driven statistic and mouth motion is derived from Learn2Talk for a holistic 3D avatar. The approach yields higher motion diversity and stronger audio-motion alignment than baselines, with perceptual user studies favoring its realism, and LightGazeFit provides competitive, low-resolution 3D eye gaze fitting. Together, these contributions advance phonetic-aligned, pluralistic 3D talking avatars suitable for immersive human-computer interaction and virtual characters.

Abstract

Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/
Paper Structure (38 sections, 15 equations, 9 figures, 8 tables)

This paper contains 38 sections, 15 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Given a speech audio, our TalkingEyes can synthesize pluralistic eye gaze motions, eye blinks and head motions and facial motions, and further be used to drive a 3D Gaussian Splatting (3DGS) 3DGSTOG2023 based head avatar GaussianAvatarsCVPR2024.
  • Figure 2: The videos and their reconstructed 3D mesh sequences in TKED.
  • Figure 3: The pipeline of the dataset construction.
  • Figure 4: The 3D eyeball model used in our method (left) and its relative position and orientation to the FLAME head model (right).
  • Figure 5: The pipeline of TalkingEyes. The training comprises two stages: (a) learning the discrete latent space for eye gaze motions by pre-training VQVAE, (b) jointly training the VAE-based speech-to-head translation and the VQVAE-based speech-to-gaze translation in an autoregressive manner. In the inference, the ground truth head motions and the head motion encoder in VAE are removed.
  • ...and 4 more figures