Table of Contents
Fetching ...

EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

Ronan John, Aditya Kesari, Vincenzo DiMatteo, Kristin Dana

TL;DR

This work introduces EgoCampus, a large outdoor egocentric gaze dataset collected with Project Aria glasses, and EgoCampusNet (ECN), a spatio-temporal fusion model that predicts pedestrian gaze heatmaps from egocentric video. By leveraging synchronized RGB video, eye gaze, IMU, GPS, and other sensors across 82 participants and 25 campus paths, the authors demonstrate strong gaze-prediction performance using temporal video features and a query-frame encoder. ECN achieves state-of-the-art results across standard gaze-saliency metrics (AUC-Judd, CC, KLD, SIM, NSS) and outperforms pretrained baselines that were not trained on EgoCampus, emphasizing the value of environment-aware, navigation-driven gaze modeling. The dataset and model pave the way for improved navigation and robot-human interaction in real-world settings, with additional resources like the YOPO-Campus robot-view dataset to support multimodal navigation research.

Abstract

We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .

EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

TL;DR

This work introduces EgoCampus, a large outdoor egocentric gaze dataset collected with Project Aria glasses, and EgoCampusNet (ECN), a spatio-temporal fusion model that predicts pedestrian gaze heatmaps from egocentric video. By leveraging synchronized RGB video, eye gaze, IMU, GPS, and other sensors across 82 participants and 25 campus paths, the authors demonstrate strong gaze-prediction performance using temporal video features and a query-frame encoder. ECN achieves state-of-the-art results across standard gaze-saliency metrics (AUC-Judd, CC, KLD, SIM, NSS) and outperforms pretrained baselines that were not trained on EgoCampus, emphasizing the value of environment-aware, navigation-driven gaze modeling. The dataset and model pave the way for improved navigation and robot-human interaction in real-world settings, with additional resources like the YOPO-Campus robot-view dataset to support multimodal navigation research.

Abstract

We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed gaze prediction model, ECN. The Project Aria glasses capture a stream of egocentric video. From this video, we extract features with a pretrained backbone. From the query image (usually the last frame), we extract features with a trained image encoder. Lastly, the video and image features are fused and decoded in order to predict the final output, representing where people are most likely to look during egocentric motion.
  • Figure 2: Samples of video sequences from the EgoCampus dataset. EgoCampus contains 32 hours of video from the Project Aria glasses, capturing egocentric video from 82 pedestrians traversing 25 distinct paths on a university campus. Each frame has an associated eye gaze coordinate (shown above as the red "+") and auxiliary sensor readings (IMU, GPS, Wi-Fi).
  • Figure 3: Dataset Paths. A map showing a subset of the campus region, with colored paths indicating the participants' walking trajectories. During data collection, each participant follows a set of paths forwards and backwards. A sample of the captured egocentric video is shown.
  • Figure 4: An overview of the proposed spatio-temporal fusion method. A pre-trained video feature extractor backbone is used to extract spatio-temporal features that encode information about the input frames. The spatio-temporal features are encoded and upscaled with a ResNet block. In parallel, the query frame is encoded with a seperate ResNet block. The image features and spatio-temporal features are concatenated along the feature dimension before being decoded into the final output.
  • Figure 5: Qualitative comparison of predicted gaze heat maps on sample frames. For each scene (row), we show the query frame with the gaze point, the output from our proposed model, and outputs from six comparison methods. Our model consistently produces a more focal heatmap that is more accurately centered on the true gaze point. Note that the white $+$ icon denotes true gaze point in all images.