Keypoint Aware Masked Image Modelling
Madhava Krishna, A V Subramanyam
TL;DR
This work addresses the weakness of SimMIM in linear probing by introducing KAMIM, a patch-wise weighting scheme derived from keypoint density computed via FAST (and alternatives SIFT, ORB). By weighting reconstruction loss with patch reliability, KAMIM delivers substantial gains in linear probing on ImageNet-1K with ViT-B (from 16.12% to 33.97%) and modest improvements in finetuning, while preserving training efficiency. Comprehensive experiments across datasets and architectures show that KAMIM benefits larger pretraining datasets and yields representations with contrastive-like properties, including longer attention distances and global self-attention. The study also analyzes the learned representations through token-level visualization and Fourier analysis, drawing parallels to contrastive learning and revealing attention-collapse-like behavior; the authors provide public code for replication.
Abstract
SimMIM is a widely used method for pretraining vision transformers using masked image modeling. However, despite its success in fine-tuning performance, it has been shown to perform sub-optimally when used for linear probing. We propose an efficient patch-wise weighting derived from keypoint features which captures the local information and provides better context during SimMIM's reconstruction phase. Our method, KAMIM, improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We conduct extensive testing on different datasets, keypoint extractors, and model architectures and observe that patch-wise weighting augments linear probing performance for larger pretraining datasets. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers. Our code is publicly available at https://github.com/madhava20217/KAMIM.
