Table of Contents
Fetching ...

FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality

Wenxuan Liu, Monde Duinkharjav, Qi Sun, Sai Qian Zhang

TL;DR

FovealNet addresses the latency-accuracy challenge in gaze-tracked foveated rendering by integrating event-based eye region cropping, token-wise pruning within a ViT-based gaze tracker, and a performance-aware, multi-resolution training framework. The method directly ties gaze-tracking error to rendering latency, enabling dynamic runtime adaptation that reduces per-frame latency while preserving high perceptual quality, as evidenced by a ≥1.42× speedup and a notable perceptual quality gain. Key innovations include an efficient pupil-centered region-cropping algorithm, token packaging for selective attention, and a loss design that minimizes worst-case gaze errors, all validated on OpenEDS2020 and perceptual metrics like FovVideoVDP. The work demonstrates practical impact for VR/AR systems by enabling faster, more reliable gaze-contingent rendering across devices and rendering resolutions, with strong tail-error reductions and near-lossless foveation in perceptual tests.

Abstract

Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textit{FovealNet}, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over $64.8\%$ of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least $1.42\times$ speed up compared to previous methods and 13\% increase in perceptual quality for foveated output.

FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality

TL;DR

FovealNet addresses the latency-accuracy challenge in gaze-tracked foveated rendering by integrating event-based eye region cropping, token-wise pruning within a ViT-based gaze tracker, and a performance-aware, multi-resolution training framework. The method directly ties gaze-tracking error to rendering latency, enabling dynamic runtime adaptation that reduces per-frame latency while preserving high perceptual quality, as evidenced by a ≥1.42× speedup and a notable perceptual quality gain. Key innovations include an efficient pupil-centered region-cropping algorithm, token packaging for selective attention, and a loss design that minimizes worst-case gaze errors, all validated on OpenEDS2020 and perceptual metrics like FovVideoVDP. The work demonstrates practical impact for VR/AR systems by enabling faster, more reliable gaze-contingent rendering across devices and rendering resolutions, with strong tail-error reductions and near-lossless foveation in perceptual tests.

Abstract

Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textit{FovealNet}, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least speed up compared to previous methods and 13\% increase in perceptual quality for foveated output.

Paper Structure

This paper contains 27 sections, 5 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: (a) TFR system configuration. (b) Foveated rendering in VR device.
  • Figure 2: Given an input eye image, FovealNet first crops the image to remove background patches (step 1). The remaining patches undergo fine-grained token-wise pruning to eliminate unimportant tokens (step 2), and the remaining tokens are processed by the ViT.
  • Figure 3: (a) TFR system configuration. (b) Latency breakdown (normalized) of TFR.
  • Figure 4: (a) Visual quality degradation due to tracking error, and then the foveal region is enlarged for better visual quality. (b) Observer's ability to discriminate foveated image with and without tracking error, measured by JND. $x$-axis indicates the eccentricity angle subtended by the fovea. Left $y$-axis is the JND score, right side of $y$-axis is the discriminability probabilities from ground truth.
  • Figure 5: (a) Predicted gaze error distributions on the OpenEDS2020 dataset, showing mean, 5th, 95th percentiles, min, and max angular errors. NVgaze results were excluded due to high tracking error and inconsistent performance. (b) Rendering latency for existing methods in both average and max error scenarios with a resolution of $1080\times1920$. (c) Rendering latency increases with eccentricity on HMD and GPU at resolutions $720\times1280$, $1080\times1920$, and $1440\times2560$, with the shaded area showing the $5\%$-$95\%$ confidence interval.
  • ...and 6 more figures