Table of Contents
Fetching ...

Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers

TL;DR

This work investigates how to efficiently adapt foundation models to unseen tasks without gradient-based fine-tuning by introducing the IMPRINT framework, which decomposes imprinting into generation, normalization, and aggregation. It shows that using multiple proxies per class via k-means, paired with L2 normalization and max aggregation, yields robust improvements, especially in low-data and less-collapsed regimes. A key link is established between neural collapse (NC1) and the effectiveness of multi-proxy imprinting, providing a principled criterion for proxy design. The study offers extensive experiments across CNN and Transformer FMs on multiple image datasets and releases code for replication and further research.

Abstract

The capacity of foundation models allows for their application to new, unseen tasks. The adaptation to such tasks is called transfer learning. An efficient transfer learning method that circumvents parameter optimization is imprinting. The conceptual differences between studies on imprinting form the basis of our systematic investigation. In this work, we propose the general \texttt{IMPRINT} framework, identifying three main components: generation, normalization, and aggregation. Through the lens of this framework, we conduct an in-depth analysis and a comparison of the existing methods. Our findings reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. Beyond an extensive analytical grounding, our framework enables us to propose a novel variant of imprinting which outperforms previous work on transfer learning tasks by 4\%. This variant determines proxies through clustering motivated by the neural collapse phenomenon -- a connection that we draw for the first time. We publicly release our code at https://github.com/DATEXIS/IMPRINT.

Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

TL;DR

This work investigates how to efficiently adapt foundation models to unseen tasks without gradient-based fine-tuning by introducing the IMPRINT framework, which decomposes imprinting into generation, normalization, and aggregation. It shows that using multiple proxies per class via k-means, paired with L2 normalization and max aggregation, yields robust improvements, especially in low-data and less-collapsed regimes. A key link is established between neural collapse (NC1) and the effectiveness of multi-proxy imprinting, providing a principled criterion for proxy design. The study offers extensive experiments across CNN and Transformer FMs on multiple image datasets and releases code for replication and further research.

Abstract

The capacity of foundation models allows for their application to new, unseen tasks. The adaptation to such tasks is called transfer learning. An efficient transfer learning method that circumvents parameter optimization is imprinting. The conceptual differences between studies on imprinting form the basis of our systematic investigation. In this work, we propose the general \texttt{IMPRINT} framework, identifying three main components: generation, normalization, and aggregation. Through the lens of this framework, we conduct an in-depth analysis and a comparison of the existing methods. Our findings reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. Beyond an extensive analytical grounding, our framework enables us to propose a novel variant of imprinting which outperforms previous work on transfer learning tasks by 4\%. This variant determines proxies through clustering motivated by the neural collapse phenomenon -- a connection that we draw for the first time. We publicly release our code at https://github.com/DATEXIS/IMPRINT.

Paper Structure

This paper contains 60 sections, 4 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: Overview of our IMPRINT framework. The foundation model FM is frozen and shows neural collapse. The weight generator (GEN) uses training data from a novel task T to consecutively generate one or more weight vectors (proxies) per class $1, \dotsc, C$ in T. In inference, the final output for the test data in T is computed by an aggregation (AGG) mechanism. Embeddings and generated weights are normalized according to NORMpre and NORMpost, respectively. During inference, embeddings are normalized according to NORMinf.
  • Figure 2: Previously studied imprinting strategies are special cases within IMPRINT. We evaluate 12 different classification tasks Ts derived from MNIST, FashionMNIST, and CIFAR-10, each with 10 classes or subsets thereof, and 4 pre-trained models FMs (resnet18, resnet50, vit_b_16, swin_b). The proposed configuration ("Ours") derived from IMPRINT outperforms previous work across FMs and Ts by a large margin with statistical significance, as confirmed by the critical difference (CD) diagram below. Since absolute accuracies vary substantially across tasks and models (as reflected by the large standard deviations (std)), this rank-based aggregation is used for fair comparison. Here, $k=20$ is used, highlighting the gain of using multiple proxies per class. For reference, the gray row reports an oracle method that uses cross-class feature statistics to generate weights (see \ref{['ss:optimalweights']}). It is not an imprinting method and therefore not directly comparable to the imprinting-based approaches above. Nonetheless, the results indicate that our method substantially narrows the gap between single-proxy mean imprinting and this oracle baseline.
  • Figure 3: Left: The embeddings of the pre-training data, after being used to train the foundation model FM, show neural collapse, as each class ($o_1,\dotsc,o_4$) is evenly separated in space and accumulates around their respective class means. Right: For a novel task with classes $c_1,c_2$ (pink and brown) scatter around the collapsed pre-trained classes (gray).
  • Figure 4: Combining multiple classes into one to create tasks with multi-modal class distributions. Simplified example for "$d$ in 1", $d=1,2,3$, with only six (instead of 100) original classes.
  • Figure 5: Benchmarking GEN mechanism for $k \le 20$ across FMs and Ts. Best NORM combination for each row used implicitly. AGG is fixed to max. CD diagram proves that k-means weight generation is significantly better than all other methods.
  • ...and 14 more figures