MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Benedikt Alkin; Lukas Miklautz; Sepp Hochreiter; Johannes Brandstetter

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Benedikt Alkin, Lukas Miklautz, Sepp Hochreiter, Johannes Brandstetter

TL;DR

The paper addresses the mismatch between MIM pretraining and downstream task readiness by showing that meaningful, abstract representations in MIM models concentrate in intermediate encoder blocks. It introduces MIM-Refiner, a sequential refinement method that attaches multiple Instance Discrimination heads to late encoder blocks and uses Nearest Neighbor Alignment to form semantic clusters without altering the primary MIM objective. Empirically, it delivers state-of-the-art linear probing and strong low-shot, clustering, and transfer performance across multiple MIM backbones, including large ViT models, while remaining compute-efficient. The approach demonstrates strong potential as a scalable foundation-model refinement technique, applicable to diverse downstream tasks with minimal training overhead.

Abstract

We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-trained MIM models. MIM-Refiner is motivated by the insight that strong representations within MIM models generally reside in intermediate layers. Accordingly, MIM-Refiner leverages multiple contrastive heads that are connected to different intermediate layers. In each head, a modified nearest neighbor objective constructs semantic clusters that capture semantic information which improves performance on downstream tasks, including off-the-shelf and fine-tuning settings. The refinement process is short and simple - yet highly effective. Within a few epochs, we refine the features of MIM models from subpar to state-of-the-art, off-the-shelf features. Refining a ViT-H, pre-trained with data2vec 2.0 on ImageNet-1K, sets a new state-of-the-art in linear probing (84.7%) and low-shot classification among models that are pre-trained on ImageNet-1K. MIM-Refiner efficiently combines the advantages of MIM and ID objectives and compares favorably against previous state-of-the-art SSL models on a variety of benchmarks such as low-shot classification, long-tailed classification, clustering and semantic segmentation.

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

TL;DR

Abstract

Paper Structure (59 sections, 1 equation, 13 figures, 28 tables)

This paper contains 59 sections, 1 equation, 13 figures, 28 tables.

Introduction
Block Types in Masked Image Modeling
Different blocks due to asymmetric encoder-decoder design.
What to learn from these findings?
Method
Experiments
Ablation Study
Low-shot and Feature Evaluations
Cluster Evaluations
Transfer Learning Evaluations
Fine-tuning with Large Amounts of Labels
Discussion & Future Work
Related Work
Pre-training in Computer Vision
Combining MIM and ID
...and 44 more sections

Figures (13)

Figure 1: Linear probing state-of-the-art on ImageNet-1K over the last four years.
Figure 2: Left: Downstream evaluations of pre-trained SSL models. MIM-Refiner effectively combines their respective advantages of MIM and ID without suffering from their respective disadvantages. Right: Representation quality of SSL methods evaluated via linear probing. MIM-Refiner advances the state-of-the-art on ImageNet-1K pre-training in low- and high-compute regimes.
Figure 3: Our analysis reveals different block types in MIM models. (a) MIM models are asymmetrically designed where the encoder has most of the parameters, as indicated by the percentages in the bars. (b) The representation quality of MAE encoders (measured by $k$-NN accuracy) peaks in the middle blocks before degrading when the later blocks take over parts of the decoder's task. Various MIM models and also a semantic segmentation task follow this pattern (see Appendix \ref{['app:mim_knn_per_layer']}) (c) The decline in representation quality in later blocks primarily contributes to the degradation in downstream performance. ID methods (represented by iBOT) and our MAE-Refined do not suffer from this issue. (d) Correlation of the relative improvement of reconstruction loss and $k$-NN accuracy per block. The relative improvement is the difference between subsequent blocks divided by the maximum difference over all blocks. Figure (b) and Figure \ref{['fig:appendix_mae_loss']} in the appendix show the raw $k$-NN accuracies and reconstruction losses from which the relative improvement is obtained. Middle blocks form abstract representations (large improvements in the $k$-NN accuracy, almost no improvement in reconstruction loss), later blocks take over parts of the reconstruction task (decrease in the $k$-NN accuracy, large improvement in the reconstruction loss). Additional details can be found in Appendix \ref{['appendix_mae_loss_per_block']}.
Figure 4: Left: Comparison of different pre-training schemes. ID uses a single ID head, whereas MIM models use a light-weight decoder to train an encoder. MIM-Refiner attaches multiple ID heads to the later third of the blocks of a pre-trained MIM encoder. Right: Nearest Neighbor Alignment (NNA). An anchor sample is attracted by its NN and simultaneously repelled from other samples in the batch. The NN is retrieved from a first-in-first-out queue of samples from previous batches.
Figure 5: UMAP umap plots of ViT-L embeddings using all 53 food related classes of ImageNet. The corresponding $k$-means cluster accuracy (ACC) and class separation measured in silhouette score (SIL) is shown below each plot. The clustering after refinement (b) is visually more condensed and better separated with corresponding improvements in ACC and SIL than before refinement (a). Mugs (c) does not separate the clusters that well, as shown by the merged clusters in the middle and the lower SIL score. The colors show the 53 ground truth food classes.
...and 8 more figures

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

TL;DR

Abstract

MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (13)