Table of Contents
Fetching ...

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Heeseung Yun, Joonil Na, Jaeyeon Kim, Calvin Murdock, Gunhee Kim

TL;DR

This work tackles forecasting where a person will visually perceive their 3D environment, a key precursor to action in daily life and AR/VR contexts. It introduces EgoSpanLift to lift 2D egocentric gaze history into multi-level 3D volumetric regions derived from SLAM keypoints, and pairs this with a forecasting network that combines a 3D U-Net encoder and a unidirectional transformer to predict future visual spans as a 4-channel occupancy grid over a horizon $T_f$. The authors also curate two large benchmarks, FoVS-Aria and FoVS-EgoExo, totaling 364.6K samples, and show that their end-to-end approach outperforms diverse baselines in 3D visual-span forecasting, with strong transferability to 2D gaze anticipation when projected back to image planes. This framework enables real-time, perception-aware assistance for AR/VR and assistive technologies by modeling where users will look in 3D space ahead of time and without requiring extensive 2D retraining.

Abstract

People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid. In addition, we curate a comprehensive benchmark from raw egocentric multisensory data, creating a testbed with 364.6K samples for 3D visual span forecasting. Our approach outperforms competitive baselines for egocentric 2D gaze anticipation and 3D localization while achieving comparable results even when projected back onto 2D image planes without additional 2D-specific training.

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

TL;DR

This work tackles forecasting where a person will visually perceive their 3D environment, a key precursor to action in daily life and AR/VR contexts. It introduces EgoSpanLift to lift 2D egocentric gaze history into multi-level 3D volumetric regions derived from SLAM keypoints, and pairs this with a forecasting network that combines a 3D U-Net encoder and a unidirectional transformer to predict future visual spans as a 4-channel occupancy grid over a horizon . The authors also curate two large benchmarks, FoVS-Aria and FoVS-EgoExo, totaling 364.6K samples, and show that their end-to-end approach outperforms diverse baselines in 3D visual-span forecasting, with strong transferability to 2D gaze anticipation when projected back to image planes. This framework enables real-time, perception-aware assistance for AR/VR and assistive technologies by modeling where users will look in 3D space ahead of time and without requiring extensive 2D retraining.

Abstract

People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid. In addition, we curate a comprehensive benchmark from raw egocentric multisensory data, creating a testbed with 364.6K samples for 3D visual span forecasting. Our approach outperforms competitive baselines for egocentric 2D gaze anticipation and 3D localization while achieving comparable results even when projected back onto 2D image planes without additional 2D-specific training.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Can we forecast our gaze beyond the frame? We aim to predict a person's future visual focus in 3D surrounding environment by lifting egocentric 2D gaze history to 3D regions and forecasting future 3D visual spans from previous observations.
  • Figure 2: Overview of EgoSpanLift. Using 3D semidense keypoints from egocentric observations, e.g., SLAM, we filter observed keypoints at a given time window and leverage multi-level human visual span to compute volumetric regions in 3D scenes.
  • Figure 3: (a) Illustration of our framework and (b-d) diverse scenario examples in our curated dataset.
  • Figure 4: Analysis of our proposed framework on the FoVS-EgoExo test split.
  • Figure 5: Qualitative examples. Our framework effectively forecasts various scenarios, such as (a) closing the fridge and turning around or (b) deciding which rocks to grab and navigate.