Table of Contents
Fetching ...

GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

Farhaan Ebadulla, Chiraag Mudlpaur, Shreya Chaurasia, Gaurav BV

TL;DR

This work tackles the challenge of predicting user gaze in VR to enable efficient foveated rendering without relying on expensive eye-tracking hardware. A multimodal architecture combines temporal gaze history, head orientation, and scene content through a CNN-based scene encoder, dual LSTMs, and a three-way gated fusion, with multi-step autoregressive prediction and auxiliary losses for robust learning. The approach achieves strong cross-scene generalization (about $93.1\%$ accuracy for 1–3 frames ahead on 22 scenes) and maintains real-time performance (~88 FPS, ~4.21 ms latency), with a user study indicating preserved perceptual quality and rendering savings. These results demonstrate practical potential for attention-aware VR rendering on mobile hardware and provide a foundation for expanding multimodal predictive cues in immersive environments.

Abstract

Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

TL;DR

This work tackles the challenge of predicting user gaze in VR to enable efficient foveated rendering without relying on expensive eye-tracking hardware. A multimodal architecture combines temporal gaze history, head orientation, and scene content through a CNN-based scene encoder, dual LSTMs, and a three-way gated fusion, with multi-step autoregressive prediction and auxiliary losses for robust learning. The approach achieves strong cross-scene generalization (about accuracy for 1–3 frames ahead on 22 scenes) and maintains real-time performance (~88 FPS, ~4.21 ms latency), with a user study indicating preserved perceptual quality and rendering savings. These results demonstrate practical potential for attention-aware VR rendering on mobile hardware and provide a foundation for expanding multimodal predictive cues in immersive environments.

Abstract

Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

Paper Structure

This paper contains 24 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Complete system architecture showing CNN scene encoder, LSTM temporal encoders for gaze and head sequences, three-way gated fusion mechanism, and multi-step prediction heads. The architecture processes 15-frame input sequences to generate 1, 2, and 3-frame ahead gaze predictions with auxiliary losses ensuring individual encoder contributions.
  • Figure 2: Training and validation loss curves demonstrating stable convergence and early stopping effectiveness.
  • Figure 3: CPU and GPU utilization during real-time gaze prediction inference, demonstrating computational efficiency suitable for VR applications.
  • Figure 4: Frame rate performance over time showing consistent 88.4 FPS average with excellent stability for VR applications.
  • Figure 5: Visual comparison of normal rendering (left) versus foveated rendering (right) for ball bouncing (top) and road crossing (bottom) scenarios, showing attention-aware quality allocation.