Table of Contents
Fetching ...

A linearized framework and a new benchmark for model selection for fine-tuning

Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, Pietro Perona

TL;DR

The paper addresses pre-selecting the best pre-trained model to fine-tune from a diverse zoo without training, focusing on low-data transfer. It introduces a linearized approximation to fine-tuning inspired by the Neural Tangent Kernel and derives two baselines, Label-Gradient Correlation (LGC) and Label-Feature Correlation (LFC), from the interaction between target labels and pre-trained gradients or features. A large-scale benchmark with 30 single-domain and a multi-domain expert trained on 8 source datasets across many target tasks demonstrates that model zoos can outperform Imagenet-based fine-tuning, especially in data-scarce regimes. The results show LGC and particularly LFC correlate strongly with actual fine-tuning performance, enabling fast, few-shot model selection that reduces brute-force searches and offers practical gains for domain transfer. Overall, the work provides a principled, scalable approach to model reuse and a benchmark to advance model-selection research.

Abstract

Fine-tuning from a collection of models pre-trained on different domains (a "model zoo") is emerging as a technique to improve test accuracy in the low-data regime. However, model selection, i.e. how to pre-select the right model to fine-tune from a model zoo without performing any training, remains an open topic. We use a linearized framework to approximate fine-tuning, and introduce two new baselines for model selection -- Label-Gradient and Label-Feature Correlation. Since all model selection algorithms in the literature have been tested on different use-cases and never compared directly, we introduce a new comprehensive benchmark for model selection comprising of: i) A model zoo of single and multi-domain models, and ii) Many target tasks. Our benchmark highlights accuracy gain with model zoo compared to fine-tuning Imagenet models. We show our model selection baseline can select optimal models to fine-tune in few selections and has the highest ranking correlation to fine-tuning accuracy compared to existing algorithms.

A linearized framework and a new benchmark for model selection for fine-tuning

TL;DR

The paper addresses pre-selecting the best pre-trained model to fine-tune from a diverse zoo without training, focusing on low-data transfer. It introduces a linearized approximation to fine-tuning inspired by the Neural Tangent Kernel and derives two baselines, Label-Gradient Correlation (LGC) and Label-Feature Correlation (LFC), from the interaction between target labels and pre-trained gradients or features. A large-scale benchmark with 30 single-domain and a multi-domain expert trained on 8 source datasets across many target tasks demonstrates that model zoos can outperform Imagenet-based fine-tuning, especially in data-scarce regimes. The results show LGC and particularly LFC correlate strongly with actual fine-tuning performance, enabling fast, few-shot model selection that reduces brute-force searches and offers practical gains for domain transfer. Overall, the work provides a principled, scalable approach to model reuse and a benchmark to advance model-selection research.

Abstract

Fine-tuning from a collection of models pre-trained on different domains (a "model zoo") is emerging as a technique to improve test accuracy in the low-data regime. However, model selection, i.e. how to pre-select the right model to fine-tune from a model zoo without performing any training, remains an open topic. We use a linearized framework to approximate fine-tuning, and introduce two new baselines for model selection -- Label-Gradient and Label-Feature Correlation. Since all model selection algorithms in the literature have been tested on different use-cases and never compared directly, we introduce a new comprehensive benchmark for model selection comprising of: i) A model zoo of single and multi-domain models, and ii) Many target tasks. Our benchmark highlights accuracy gain with model zoo compared to fine-tuning Imagenet models. We show our model selection baseline can select optimal models to fine-tune in few selections and has the highest ranking correlation to fine-tuning accuracy compared to existing algorithms.

Paper Structure

This paper contains 16 sections, 1 theorem, 20 equations, 8 figures, 4 tables.

Key Result

Proposition 1

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ be the target dataset. Assume the task is a binary classification problem with labels $y_i=\pm 1$,This is to simplify the notation, but a similar result would hold for a multi-class classification using one-hot encoding. Using the $L_2$ loss is necessary to where $f_{w_0}(\mathcal{X})$ denotes the vector containing the output of the network on all the ima

Figures (8)

  • Figure 1: Fine-tuning using our model zoo can obtain lower test error compared to: $(a)$ using different architectures and $(b)$ hyper-parameter optimization (HPO) of Imagenet expert. The standard fine-tuning approach entails picking a network architecture pre-trained on Imagenet to fine-tune and performing hyper-parameter optimization (HPO) during fine-tuning. We outperform this strategy by fine-tuning using our model zoo described in Sec. \ref{['sec:model_zoo']}. We plot test error as a function of the number of per-class samples (i.e. shots) in the dataset. In $(a)$, we compare fine-tuning with our single-domain experts in the model zoo to using different architectures (AlexNet, ResNet-18, ResNet-101, Wide ResNet-101) for fine-tuning. In $(b)$, we show fine-tuning with our model zoo obtains lower error than performing HPO on Imagenet pre-trained Resnet-101 7780459 during fine-tuning. Model zoo lowers the test error, especially in the low-data regime (5, 10, 20-shot per class samples of target task). Since we compare to Imagenet fine-tuning, we exclude Imagenet experts from our model zoo for the above plots.
  • Figure 2: Fine-tuning with model zoo of single-domain experts. We plot top-1 test error (vertical axis) for fine-tuning with different single domain models in our model zoo. For every target task (on horizontal axis), we have $4$ columns of markers from left to right: $1)$ Imagenet experts in red, $2)$ Densenet-169 experts with pre-train ($\checkmark$) and without pre-train ($\times$), $3)$ Resnet-101 experts with pre-train ($\checkmark$) and without pre-train ($\times$), $4)$ We use " black $\leftarrow$" to highlight models that perform better than imagenet expert (i.e. lower error than first column of Imagenet expert per task). Our observations are the following: $i)$ For full target task, we observe better accuracy than Imagenet expert for Magnetic Tile Defects, UC Merced Land Use and iCassava (see black $\leftarrow$). For 20 and 5-shot per class sampling of target task, with the model zoo we outperform Imagenet expert on more datasets, see Oxford Flowers 102, European Flood Depth, Belga Logos and Cub200. Our empirical result, on the importance of different pre-trainings of our model zoo experts when training data is limited, adds to the growing body of similar results in existing literature he2019rethinkingli2020rethinkingzoph2020rethinking, and $ii)$ The accuracy gain over Imagenet expert is only obtained for fine-tuning with select few models for a given target task, e.g. only one expert for UC Merced Land Use target task in Full, 20-shot setting above. Therefore, brute-force fine-tuning with model zoo leads to wasteful computation. Model selection (Sec. \ref{['sec:approach']}) picks the best models to fine-tune and avoids brute-force fine-tuning. Figure is best viewed in high-resolution.
  • Figure 3: Fine-tuning with the multi-domain expert for the full target task. We use the same notation as Fig. \ref{['fig:finetune_full']}. For every target task (horizontal axis), we have $4$ columns corresponding to fine-tuning different models from left to right: $1)$ Imagenet single and multi-domain expert in red, $2)$ Fine-tuning with different domains of multi-domain expert in green and $3)$ Single-domain Resnet-101 experts in blue, $4)$ We highlight multi-domain experts that obtain lower error than Imagenet single domain with black $\leftarrow$. Note, since our multi-domain expert is Resnet-101 based, we only use all our Resnet-101 experts for for fair comparison. Our observations are: $i)$ We see gains over Imagenet expert (both single and multi-domain) by fine-tuning some (not all) domains of the multi-domain expert, for Magentic Tile Defects, Oxford Flowers 102, Cucumber and iCassava target tasks. Therefore, it is important to pick the correct domain from the multi-domain expert for fine-tuning. $ii)$ We observe the variance in error is smaller for fine-tuning with different domains of multi-domain experts, possibly due to shared parameters across domains, $iii)$ Finally in some cases, e.g. Oxford Flowers 102 and iCassava, our multi-domain experts outperform both, all single domain and Imagenet experts. Figure is best viewed in high-resolution.
  • Figure 4: Model selection among single-domain experts. The heatmap shows the accuracy gain over Resnet-101 Imagenet expert obtained by fine-tuning the top-$3$ selected models for different model selection methods (column) on our target tasks (row). Higher values of gain are better. Note, for every method we fine-tune all the top-$3$ selected models (with same hyper-parameters as Sec. \ref{['sec:finetune']}) and pick the one with the highest accuracy. Model selection performs better than "Worst Gain" and random selection. On average, LFC, LGC and LEEP nguyen2020leep outperform Domain Similarity Cui2018iNatTransfer, RSA DwivediR19. Feature Metrics ueno2020a performs better than LFC, LEEP in high-data regime, but under-performs in the low-data regime.
  • Figure 5: Model Selection with multi-domain expert. The heatmap shows accuracy gain obtained by fine-tuning selected domain over fine-tuning Imagenet domain from the multi-domain expert. We show results for top-$1$ and top-$3$ selections. LFC, LEEP nguyen2020leep are close to the best gain and they outperform Feature Metrics ueno2020a and Random.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1