Table of Contents
Fetching ...

3D Gaussian and Diffusion-Based Gaze Redirection

Abiram Panchalingam, Indu Bodala, Stuart Middleton

TL;DR

DiTGaze tackles high-fidelity, continuous gaze redirection for augmented training data by integrating a Latent Diffusion Transformer (DiT) renderer into a two-stream 3D Gaussian Splatting framework, augmented with a weak gaze interpolation strategy and an orthogonality loss to disentangle gaze, pose, and expression. It achieves state-of-the-art gaze fidelity and identity preservation on ETH-XGaze and generalizes well across ColumbiaGaze, MPIIFaceGaze, and GazeCapture, reducing gaze error and improving perceptual metrics. The key contributions are (1) adopting a Latent Diffusion Transformer for high-quality rendering in 3DGS, (2) a Weak Gaze Interpolation scheme to learn a smooth gaze manifold, and (3) an Orthogonality Constraint Loss that enforces disentanglement between control factors. This work enables more realistic synthetic training data for gaze estimators and demonstrates the practical potential of combining diffusion-based rendering with 3D-aware avatar models, while acknowledging computational costs and pose stability trade-offs.

Abstract

High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.

3D Gaussian and Diffusion-Based Gaze Redirection

TL;DR

DiTGaze tackles high-fidelity, continuous gaze redirection for augmented training data by integrating a Latent Diffusion Transformer (DiT) renderer into a two-stream 3D Gaussian Splatting framework, augmented with a weak gaze interpolation strategy and an orthogonality loss to disentangle gaze, pose, and expression. It achieves state-of-the-art gaze fidelity and identity preservation on ETH-XGaze and generalizes well across ColumbiaGaze, MPIIFaceGaze, and GazeCapture, reducing gaze error and improving perceptual metrics. The key contributions are (1) adopting a Latent Diffusion Transformer for high-quality rendering in 3DGS, (2) a Weak Gaze Interpolation scheme to learn a smooth gaze manifold, and (3) an Orthogonality Constraint Loss that enforces disentanglement between control factors. This work enables more realistic synthetic training data for gaze estimators and demonstrates the practical potential of combining diffusion-based rendering with 3D-aware avatar models, while acknowledging computational costs and pose stability trade-offs.

Abstract

High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.

Paper Structure

This paper contains 19 sections, 15 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Gaze redirection: Given an input image and target gaze, DiTGaze utilizes a 3DGS model and a DiT renderer to generate high-fidelity head images with accurate gaze redirection.
  • Figure 2: Pipeline of DiTGaze. We initialize face and eye Gaussians from a pre-trained neutral mesh. The Intermediate Gaze Sampler (Ours) generates inputs for the Eye Rotation Field ($\Delta q$), while pose and expression codes drive the Face Deform Field ($\Delta x$). To ensure disentanglement, our Orthogonality Loss ($L_{\text{Ortho}}$) is applied to both fields. The resulting Gaussians are splatted into feature maps, concatenated, and fed into our Latent DiT Renderer (Ours), which uses AdaLN conditioning to synthesize the final image.
  • Figure 3: Within-dataset visualization: Head images are generated from the ETH-XGaze test set comparing DiTGaze (Ours) against GazeNeRF and GazeGaussian. Baseline results are reproduced directly from GazeGaussian's publication. Our DiT-based model not only preserves identity and matches the target gaze, but also generates superior, high-fidelity facial details, such as individual strands of hair. In contrast, GazeNeRF suffers from significant identity loss and blur, while the GazeGaussian baseline produces softer, less realistic textures.