Table of Contents
Fetching ...

Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements

Robin Weiler, Matthias Brucklacher, Cyriel M. A. Pennartz, Sander M. Bohté

TL;DR

This paper investigates masked image modeling (MIM) as a brain-inspired self-supervised learning framework that leverages eye movements and focused foveal vision. By comparing masking strategies, data augmentations, and foreground versus background loss, the authors show that peripheral masking, combined with predictive reconstruction, yields strong, decorrelated latent representations and competitive linear readout accuracy. They demonstrate generalization to different architectures and reveal that latent space decorrelation arises without explicit regularization, linking MIM to latent-regularization approaches. The work provides a biologically plausible lens on MIM, highlights the role of data augmentation, and suggests future directions involving sequential masked viewing and covert attention, with code available for replication.

Abstract

To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-supervised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. We find that MIM disentangles neurons in latent space without explicit regularization, a property that has been suggested to structure visual representations in primates. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under https://github.com/RobinWeiler/FocusMIM

Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements

TL;DR

This paper investigates masked image modeling (MIM) as a brain-inspired self-supervised learning framework that leverages eye movements and focused foveal vision. By comparing masking strategies, data augmentations, and foreground versus background loss, the authors show that peripheral masking, combined with predictive reconstruction, yields strong, decorrelated latent representations and competitive linear readout accuracy. They demonstrate generalization to different architectures and reveal that latent space decorrelation arises without explicit regularization, linking MIM to latent-regularization approaches. The work provides a biologically plausible lens on MIM, highlights the role of data augmentation, and suggests future directions involving sequential masked viewing and covert attention, with code available for replication.

Abstract

To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-supervised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. We find that MIM disentangles neurons in latent space without explicit regularization, a property that has been suggested to structure visual representations in primates. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under https://github.com/RobinWeiler/FocusMIM
Paper Structure (17 sections, 2 equations, 6 figures)

This paper contains 17 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Representation learning through eye movements. (a) Random patch masks used in artificial MIM approaches, gray patches are hidden from the network. (b) In our approach, selective masking is achieved through the inhomogeneous nature of foveal vision. Eye movements depicted as vectors reveal previously inaccessible or distorted information that is compared to the prediction $x^{t+\delta}$. (c) After pretraining on the prediction task, the quality of latent representations in the latent representation is assessed through classification accuracy in linear probing.
  • Figure 2: The masking strategies define the pretraining task. Top row: Exemplary inputs for different pretraining paradigms, with masked areas covered in the image's gray average color. Masked periphery and random patches are shown with 60% masking ratio. Bottom row: Predictions from the given inputs after pretraining. Right image: Ground truth
  • Figure 3: Representative presegmentation masks used to investigate how discarding loss signals from the background affects network performance. The masks were obtained with the rembg-library, with brightness indicating confidence, overlaid onto the respective STL-10 images.
  • Figure 4: Linear readout accuracy quantifies representation quality after pretraining. a) Influence of masking strategy in comparison to the multi-view approaches CLAPP illing2021local and LPL halvagal2023combination, that both come in two variants trained with a local learning rule or backpropagation. The remaining baselines are pure autoencoding (AE) from full image to full image in the same network as the masked methods, and directly conducting linear probing on the input images. b) Influence of data augmentation. In the left- and rightmost bar, the filled and outlined bars overlap, i.e., classification accuracy is unaffected by augmentation. c) Incomplete masking by Gaussian blurring, instead of uniform average coloring, drastically decreased representation quality. d) Masked image modeling implicitly decorrelates latent space neurons. Error bars and shaded regions indicate the standard deviation across five randomly seeded runs.
  • Figure 5: Influence of presegmentation and masking strategy on pretraining efficiency. (a) When using random patches, using presegmentation to weight the reconstruction loss accelerated pretraining. (b) This was not the case when restricting the reconstruction loss to the foreground object. Shaded regions indicate the standard deviation across five randomly seeded runs.
  • ...and 1 more figures