Human Gaze Boosts Object-Centered Representation Learning

Timothy Schaumlöffel; Arthur Aubret; Gemma Roig; Jochen Triesch

Human Gaze Boosts Object-Centered Representation Learning

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

TL;DR

This work demonstrates that simulating human central vision by cropping around predicted gaze locations in egocentric video, combined with a time-based self-supervised learning objective, yields more object-centered visual representations than training on the full field of view. Using Ego4D data and a MoCoV3-like framework with slow-changing representations, the authors show improvements across hard-category, fine-grained, and instance recognition tasks, while also revealing reduced background reliance. The study also finds that temporal slowness and gaze dynamics contribute meaningfully to learning, and that gaze-based crops outperform gaze-agnostic center crops. Overall, the results provide a bio-inspired pathway to enhance object-centered visual representations from human-like egocentric experiences, narrowing the gap between machine and human vision on relevant tasks.

Abstract

Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.

Human Gaze Boosts Object-Centered Representation Learning

TL;DR

Abstract

Paper Structure (28 sections, 3 equations, 5 figures, 5 tables)

This paper contains 28 sections, 3 equations, 5 figures, 5 tables.

Introduction
Related works
Egocentric SSL.
Time-based SSL.
Method
Dataset
Human-like egocentric visual experience.
Gaze location.
Bio-inspired focus on central vision.
Learning model
Evaluation
Hard object categorization.
Easy object categorization.
Fine-grained object categorization.
Instance object recognition.
...and 13 more sections

Figures (5)

Figure 1: Illustration of our data generation and model training approach. (1) We extract frames from the egocentric dataset Ego4D grauman2022ego4dworld3000hours. (2) For each frame, we predict the human gaze location (red dot) using a state-of-the-art model Lai_2022_BMVC. (3) We train a time-augmented SSL model to align representations of gaze-centered crops (red rectangle) extracted from close-in-time frames.
Figure 2: Impact of the gaze-based crop size on different semantic image recognition groups. We compute the average improvement for each semantic group of datasets with respect to $N^2=112 \times 112$. We use a temporal window of $\Delta T=15$ seconds.
Figure 3: ImageNet-9 recognition sensitivity to missing background or missing foreground object. We show the relative improvement with respect to the worst model for the two settings. The higher, the more relatively robust is the representation to missing backgrounds or missing objects. We use a temporal window $\Delta T=15$ seconds.
Figure 4: Impact of the temporal window of slowness learning on different semantic image recognition groups. We compute the average improvement for each semantic group of datasets with respect to $\Delta T=0$ second. We use a crop of size $N=224$.
Figure 5: Distribution of the gaze center location over the Ego4D dataset. The red dot symbolizes the center of the frame.

Human Gaze Boosts Object-Centered Representation Learning

TL;DR

Abstract

Human Gaze Boosts Object-Centered Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)