3D Gaussian and Diffusion-Based Gaze Redirection
Abiram Panchalingam, Indu Bodala, Stuart Middleton
TL;DR
DiTGaze tackles high-fidelity, continuous gaze redirection for augmented training data by integrating a Latent Diffusion Transformer (DiT) renderer into a two-stream 3D Gaussian Splatting framework, augmented with a weak gaze interpolation strategy and an orthogonality loss to disentangle gaze, pose, and expression. It achieves state-of-the-art gaze fidelity and identity preservation on ETH-XGaze and generalizes well across ColumbiaGaze, MPIIFaceGaze, and GazeCapture, reducing gaze error and improving perceptual metrics. The key contributions are (1) adopting a Latent Diffusion Transformer for high-quality rendering in 3DGS, (2) a Weak Gaze Interpolation scheme to learn a smooth gaze manifold, and (3) an Orthogonality Constraint Loss that enforces disentanglement between control factors. This work enables more realistic synthetic training data for gaze estimators and demonstrates the practical potential of combining diffusion-based rendering with 3D-aware avatar models, while acknowledging computational costs and pose stability trade-offs.
Abstract
High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.
