Table of Contents
Fetching ...

RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image

Hengfei Wang, Zhongqun Zhang, Yihua Cheng, Hyung Jin Chang

TL;DR

RTGaze addresses the challenge of real-time, 3D-aware gaze redirection from a single image by introducing a gaze-controllable facial representation learned from images and gaze prompts, and by distilling 3D priors from a pretrained 3D portrait generator into a lightweight triplane-based renderer. The method applies a dual-encoder backbone (high- and low-frequency features) with cross-attention-based gaze prompt injection, combined with depth-prior distillation optimized via $ \mathcal{L} = \alpha \mathcal{L}_{\mathcal{R}} + \beta \mathcal{L}_{\mathcal{D}} + \gamma \mathcal{L}_{\mathcal{P}}$ and a dedicated eye-region reconstruction loss. Empirically, RTGaze achieves state-of-the-art efficiency and competitive or superior image quality and gaze accuracy on ETH-XGaze, ColumbiaGaze, and MPIIFaceGaze, delivering processing times around $61\mathrm{ms}$ per image without requiring test-time GAN inversion. This combination of fast inference, 3D consistency, and high-fidelity gaze control has strong practical implications for real-time digital humans, AR/VR, and broadcast applications, while maintaining identity and photorealism. The key innovation lies in the separation of appearance and geometry through dual encoders, gaze prompt cross-attention, and the distillation of 3D priors into a lightweight rendering module, enabling efficient 3D-aware gaze redirection from a single image.

Abstract

Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.

RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image

TL;DR

RTGaze addresses the challenge of real-time, 3D-aware gaze redirection from a single image by introducing a gaze-controllable facial representation learned from images and gaze prompts, and by distilling 3D priors from a pretrained 3D portrait generator into a lightweight triplane-based renderer. The method applies a dual-encoder backbone (high- and low-frequency features) with cross-attention-based gaze prompt injection, combined with depth-prior distillation optimized via and a dedicated eye-region reconstruction loss. Empirically, RTGaze achieves state-of-the-art efficiency and competitive or superior image quality and gaze accuracy on ETH-XGaze, ColumbiaGaze, and MPIIFaceGaze, delivering processing times around per image without requiring test-time GAN inversion. This combination of fast inference, 3D consistency, and high-fidelity gaze control has strong practical implications for real-time digital humans, AR/VR, and broadcast applications, while maintaining identity and photorealism. The key innovation lies in the separation of appearance and geometry through dual encoders, gaze prompt cross-attention, and the distillation of 3D priors into a lightweight rendering module, enabling efficient 3D-aware gaze redirection from a single image.

Abstract

Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: 3D-aware gaze redirection results from our proposed RTGaze, which generates photo-realistic face images under novel gazes and views with good 3D consistency in real time. Compared to the state-of-the-art 3D-aware gaze redirection method GazeNeRF ruzzi2022gazenerf, which requires approximately one minute during inference, our approach achieves real-time performance at $\textbf{61ms}$ while maintaining superior image quality.
  • Figure 2: Our model takes three inputs: gaze prompts, source images, and frontal images during training. It consists of a gaze-controllable facial representation learning module and face geometric prior distillation module. First, the model extracts high-frequency and low-frequency features from the source images and injects the gaze prompt into the high-frequency features. The final representation is a fusion of the injected gaze features and the low-frequency features. This combined representation is then fed into a triplane decoder, which generates a 3D face representation in the form of a triplane. This triplane representation is used to render the final gaze-redirected image. The target image provides a mask-guided 2D constraint along the eye region. Additionally, we aim to distill 3D face geometry prior from a pre-trained 3D portrait generation model. We compute depth images from both the pre-trained model and our model, and apply a distillation loss.
  • Figure 3: Qualitative comparisons on ETH-XGaze dataset. The background is removed by applying face masks. The images generated from RTGaze are photo-realistic and have extensive details. ST-ED zheng2020self struggles to preserve identity information while retaining the unmasked green background which is not found in 3D-based methods. HeadNeRF hong2022headnerf and GazeNeRF ruzzi2022gazenerf suffer from losing facial details.
  • Figure 4: Visualization of generated results under novel views and gazes. Our model is able to generate 3D faces with controllable gazes using one single image as input. It can generate photorealistic face images in a large range of head pose and gaze directions. The results under novel views show that our model keeps good 3D consistency in the generation process. Its ability to generate consistent gaze images is also demonstrated by the results under novel gazes. Please zoom in for better visualization.
  • Figure 5: Visualization of ablation on fusion feature choices. The findings suggest that solely relying on low-frequency geometric features leads to blurriness and inaccurate gaze redirection. Conversely, combining high-frequency appearance features with gaze embedding maintains facial structure while enabling efficient gaze redirection.
  • ...and 1 more figures