Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements
Robin Weiler, Matthias Brucklacher, Cyriel M. A. Pennartz, Sander M. Bohté
TL;DR
This paper investigates masked image modeling (MIM) as a brain-inspired self-supervised learning framework that leverages eye movements and focused foveal vision. By comparing masking strategies, data augmentations, and foreground versus background loss, the authors show that peripheral masking, combined with predictive reconstruction, yields strong, decorrelated latent representations and competitive linear readout accuracy. They demonstrate generalization to different architectures and reveal that latent space decorrelation arises without explicit regularization, linking MIM to latent-regularization approaches. The work provides a biologically plausible lens on MIM, highlights the role of data augmentation, and suggests future directions involving sequential masked viewing and covert attention, with code available for replication.
Abstract
To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-supervised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. We find that MIM disentangles neurons in latent space without explicit regularization, a property that has been suggested to structure visual representations in primates. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under https://github.com/RobinWeiler/FocusMIM
