Self-Masking Networks for Unsupervised Adaptation

Alfonso Taboada Warmerdam; Mathilde Caron; Yuki M. Asano

Self-Masking Networks for Unsupervised Adaptation

Alfonso Taboada Warmerdam, Mathilde Caron, Yuki M. Asano

TL;DR

The paper tackles the challenge of adapting large pretrained vision models to downstream tasks when labeled data is scarce, while also minimizing storage by learning binary subnet masks. It introduces Self-Masking Networks (SMNs) that learn masks M over weights using a self-supervised loss, with scores S, a threshold μ, and a normalization α to keep variance, expressed as M_i = I[S_i > μ], α = √(1/N Σ I[S_i > μ]), and θ_i' = (θ_i/α) M_i. Key contributions include a hyperparameter-free masking design that is invariant to certain parameter shifts, a label-free adaptation strategy via a SwAV-based clustering objective, and a model cascade framework that trains multiple expert masks and fuses their embeddings with PCA to improve downstream accuracy under limited supervision; these approaches yield up to around 79x storage efficiency and competitive performance across eight datasets and three architectures. The work demonstrates strong results in label-efficient and semi-supervised regimes and shows that cascades can provide consistent accuracy gains (e.g., several points in linear probing) while maintaining substantial storage advantages, offering a scalable path for deploying foundation models with minimal labeled data.

Abstract

With the advent of billion-parameter foundation models, efficient fine-tuning has become increasingly important for the adaptation of models to downstream tasks. However, especially in computer vision, it can be hard to achieve good performance when access to quality labeled data is lacking. In this work, we propose a method adapting pretrained generalist models in a self-supervised manner by learning binary masks. These self-supervised masking networks (SMNs) are up to 79x more efficient to store and significantly improve performance on label-efficient downstream tasks. We validate the usefulness of learning binary masks as a fine-tuning method on 8 datasets and 3 model architectures, and we demonstrate the effectiveness of SMNs in 3 label-efficient settings.

Self-Masking Networks for Unsupervised Adaptation

TL;DR

Abstract

Paper Structure (41 sections, 2 theorems, 10 equations, 10 figures, 10 tables)

This paper contains 41 sections, 2 theorems, 10 equations, 10 figures, 10 tables.

Introduction
Related work
Frozen Network Adaptation.
Masking Neural Networks.
Self-supervised Learning on Restricted Domains.
Method
Background: Network Masking
Pass-through trick for training.
Hyperparameter-free masking
Practical implications
Label-free adaptation
Model Cascades
Combining Embeddings.
Dimensionality Reduction.
Experiments
...and 26 more sections

Key Result

theorem thmcountertheorem

Translation invariance of threshold and initialization. Shifting the score initialisation $S^0$ and the threshold $\mu$ by an equal amount does not affect SGD-based training without weight-decay.

Figures (10)

Figure 1: Conceptual comparison between two adaptation mechanisms: Standard full-finetuning versus self-supervised self-masking (Ours).
Figure 2: Cascade models work by complementing the root model's features with those of expert models which are tailored to specific parts of the training distribution.
Figure 3: Low-shot adaptation with self-supervised self-masking. We report top-1 accuracy after transferring to downstream tasks in a low-shot setting (% of labeled data used). We compare different adaptation techniques: linear probing, full fine-tuning, self-masking with a supervised objective and self-masking in a self-supervised manner. The pretrained network is $\text{ResNet-50}_{\text{SwAV}}$.
Figure 5: Left: Comparison of the loss for standard training ($\lambda=50$, $S^0=1.0$, $\mu=0.0$) with equivalent but distinct hyperparameters ($\lambda=100$, $S^0=2.5$, $\mu=0.5$). Shown is the progression of the loss during training. Right: How the curves would differ when applying standard training, except changing only one of the hyperparameters at a time (doubling the learning rate, doubling the score initialization or shifting the threshold with 0.5). These experiments were run on CIFAR-10, with the supervised masking algorithm.
Figure 6: Sparsity levels found across layers by our masking algorithm, when applied to the $\text{ViT-B/32}_{\text{CLIP}}$ model. Displayed is the average across the datasets CIFAR-10, CIFAR-100, SUN397 and DTD.
...and 5 more figures

Theorems & Definitions (4)

theorem thmcountertheorem
proof
theorem thmcountertheorem
proof

Self-Masking Networks for Unsupervised Adaptation

TL;DR

Abstract

Self-Masking Networks for Unsupervised Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)