FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality
Wenxuan Liu, Monde Duinkharjav, Qi Sun, Sai Qian Zhang
TL;DR
FovealNet addresses the latency-accuracy challenge in gaze-tracked foveated rendering by integrating event-based eye region cropping, token-wise pruning within a ViT-based gaze tracker, and a performance-aware, multi-resolution training framework. The method directly ties gaze-tracking error to rendering latency, enabling dynamic runtime adaptation that reduces per-frame latency while preserving high perceptual quality, as evidenced by a ≥1.42× speedup and a notable perceptual quality gain. Key innovations include an efficient pupil-centered region-cropping algorithm, token packaging for selective attention, and a loss design that minimizes worst-case gaze errors, all validated on OpenEDS2020 and perceptual metrics like FovVideoVDP. The work demonstrates practical impact for VR/AR systems by enabling faster, more reliable gaze-contingent rendering across devices and rendering resolutions, with strong tail-error reductions and near-lossless foveation in perceptual tests.
Abstract
Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textit{FovealNet}, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over $64.8\%$ of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least $1.42\times$ speed up compared to previous methods and 13\% increase in perceptual quality for foveated output.
