Table of Contents
Fetching ...

Temporal Slowness in Central Vision Drives Semantic Object Learning

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

TL;DR

The study investigates how central vision and temporal slowness contribute to semantic object learning from natural egocentric vision. By simulating ~five months of human-like experience with Ego4D, predicting gaze locations, and training a time-contrastive SSL on gaze-centered crops, the authors show that central vision emphasizes foreground object features while temporal slowness yields broader semantic information, especially during fixations. The approach also reveals that aligning learned representations with object co-occurrence embeddings via CKA is stronger for bio-inspired, slowness-enabled models, with ViT architectures showing notable gains. These findings provide a biologically grounded mechanism for how humans might develop semantic object representations and offer guidance for embodied AI systems leveraging gaze-informed perception and slow feature learning.

Abstract

Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.

Temporal Slowness in Central Vision Drives Semantic Object Learning

TL;DR

The study investigates how central vision and temporal slowness contribute to semantic object learning from natural egocentric vision. By simulating ~five months of human-like experience with Ego4D, predicting gaze locations, and training a time-contrastive SSL on gaze-centered crops, the authors show that central vision emphasizes foreground object features while temporal slowness yields broader semantic information, especially during fixations. The approach also reveals that aligning learned representations with object co-occurrence embeddings via CKA is stronger for bio-inspired, slowness-enabled models, with ViT architectures showing notable gains. These findings provide a biologically grounded mechanism for how humans might develop semantic object representations and offer guidance for embodied AI systems leveraging gaze-informed perception and slow feature learning.

Abstract

Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
Paper Structure (32 sections, 2 equations, 11 figures, 10 tables)

This paper contains 32 sections, 2 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Illustration of our data generation and model training approach. (1) We extract frames from the egocentric dataset Ego4D grauman2022ego4dworld3000hours. (2) For each frame, we predict the human gaze location (red dot) using a state-of-the-art model Lai_2022_BMVC. (3) We train a time-augmented SSL model to align representations of gaze-centered crops (red rectangle) extracted from close-in-time frames.
  • Figure 2: Impact of the gaze-based crop size on different semantic image recognition groups for ResNet50 and ViT-B/16. We compute the average improvement for each semantic group of datasets with respect to $N=112$. We use a temporal window of $\Delta T=3$s. Full results are provided in the Appendix \ref{['tab:completetab_gs']}.
  • Figure 3: ImageNet-9 recognition sensitivity to missing background or missing foreground object. We show the relative improvement with respect to the worst model for the two settings. The higher, the more relatively robust is the representation to missing backgrounds or missing objects. We use a temporal window $\Delta T=3$s.
  • Figure 4: Impact of the temporal window of slowness learning on different semantic image recognition groups. We compute the average improvement for each semantic group of datasets with respect to $\Delta T=0$ second. We use a crop of size $N=224$. Full results are provided in the Appendix \ref{['tab:completetab_t']}.
  • Figure 5: Impact of varying the maximal eye movement during a fixation. We report the average improvement for each semantic dataset group relative to the $P=0$ baseline, which corresponds to $\Delta T =0$. All models are trained under the same settings as the bio-inspired models described in \ref{['sec:objectlearning']}.
  • ...and 6 more figures