Table of Contents
Fetching ...

Eyes on Target: Gaze-Aware Object Detection in Egocentric Video

Vishakha Lall, Yisi Liu

TL;DR

Eyes on Target presents a depth-aware, gaze-guided object detector for egocentric video by injecting gaze-derived features into a Vision Transformer, biasing attention toward human-attended regions. It extends DETR with a gaze-modulated attention mechanism, depth- and pupil-driven RoI scaling, and a novel gaze-aware head importance metric for interpretability. Across the Egocentric Maritime Simulator, Ego Motion, Ego-CH-Gaze, and CUB-GHA datasets, the approach achieves substantial gains over gaze-agnostic baselines and competitive performance against state-of-the-art gaze methods, while offering insights into how gaze cues reshape transformer attention. This work demonstrates the practical value of aligning machine attention with human attention in safety-critical, first-person scenarios and provides a quantitative tool for explainable, gaze-guided perception.

Abstract

Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.

Eyes on Target: Gaze-Aware Object Detection in Egocentric Video

TL;DR

Eyes on Target presents a depth-aware, gaze-guided object detector for egocentric video by injecting gaze-derived features into a Vision Transformer, biasing attention toward human-attended regions. It extends DETR with a gaze-modulated attention mechanism, depth- and pupil-driven RoI scaling, and a novel gaze-aware head importance metric for interpretability. Across the Egocentric Maritime Simulator, Ego Motion, Ego-CH-Gaze, and CUB-GHA datasets, the approach achieves substantial gains over gaze-agnostic baselines and competitive performance against state-of-the-art gaze methods, while offering insights into how gaze cues reshape transformer attention. This work demonstrates the practical value of aligning machine attention with human attention in safety-critical, first-person scenarios and provides a quantitative tool for explainable, gaze-guided perception.

Abstract

Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.

Paper Structure

This paper contains 20 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Visualisation of attention modification: (a) Gaze position (circle) and direction (arrows from left and right eye) (b) Zoomed in view of the gaze point (c) Attention map from the original model (d) Attention map after gaze modifications to the original model, showing increased attention in regions around the gaze point.
  • Figure 2: Examples from the Egocentric Maritime Simulator Dataset test set, showing (from left to right) gaze points on original frames with target depth $d$, modified attention heatmaps with pupil dilation $p$, bounding box localisations and classification results, with IoU against ground truth. Frames with closer targets (a:c) display stronger heatmaps and larger boxes due to focus intensity and depth scaling, while having high IoU with ground truth (0.94), while frames distant (d:f) show smaller bounding boxes with strong IoU (0.92)
  • Figure 3: Model predictions for the Ego Motion Dataset during video streaming (a:d) and reading (e:h) tasks, with similar depth $d$ profiles. Bounding box sizes, scaled by pupil dilation $p$, are larger for the reading task, indicating higher visual focus compared to video streaming.
  • Figure 4: Model predictions for the Ego-CH-Gaze dataset (without depth and pupil dilation information)
  • Figure 5: Qualitative visualization of attention map (>60%) from heads $1^{st}$ (red), $5^{th}$(blue) and $6^{th}$(green) from Layer 2 DETR