Human Gaze Boosts Object-Centered Representation Learning
Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch
TL;DR
This work demonstrates that simulating human central vision by cropping around predicted gaze locations in egocentric video, combined with a time-based self-supervised learning objective, yields more object-centered visual representations than training on the full field of view. Using Ego4D data and a MoCoV3-like framework with slow-changing representations, the authors show improvements across hard-category, fine-grained, and instance recognition tasks, while also revealing reduced background reliance. The study also finds that temporal slowness and gaze dynamics contribute meaningfully to learning, and that gaze-based crops outperform gaze-agnostic center crops. Overall, the results provide a bio-inspired pathway to enhance object-centered visual representations from human-like egocentric experiences, narrowing the gap between machine and human vision on relevant tasks.
Abstract
Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.
