Table of Contents
Fetching ...

An empirical study of task and feature correlations in the reuse of pre-trained models

Jama Hussein Mohamud, Willie Brink

TL;DR

This work investigates how task and feature correlations influence the reuse of pre-trained models. It develops both synthetic and real-world experiments to show that Bob's transfer performance increases with task and feature correlation, and that meaningful transfer can occur even at zero correlation due to network initialization and architecture. The study introduces a controllable data-generation framework using concatenated images and analyzes layer-wise fine-tuning and attribution to understand where transfer comes from. Real-world case studies with dog-vs-cat and Open Images demonstrate semantic transfer under correlated tasks, with clear guidance on when to fine-tune more layers. Overall, the findings offer practical insights for selecting backbones and fine-tuning strategies based on task similarity and data structure.

Abstract

Pre-trained neural networks are commonly used and reused in the machine learning community. Alice trains a model for a particular task, and a part of her neural network is reused by Bob for a different task, often to great effect. To what can we ascribe Bob's success? This paper introduces an experimental setup through which factors contributing to Bob's empirical success could be studied in silico. As a result, we demonstrate that Bob might just be lucky: his task accuracy increases monotonically with the correlation between his task and Alice's. Even when Bob has provably uncorrelated tasks and input features from Alice's pre-trained network, he can achieve significantly better than random performance due to Alice's choice of network and optimizer. When there is little correlation between tasks, only reusing lower pre-trained layers is preferable, and we hypothesize the converse: that the optimal number of retrained layers is indicative of task and feature correlation. Finally, we show in controlled real-world scenarios that Bob can effectively reuse Alice's pre-trained network if there are semantic correlations between his and Alice's task.

An empirical study of task and feature correlations in the reuse of pre-trained models

TL;DR

This work investigates how task and feature correlations influence the reuse of pre-trained models. It develops both synthetic and real-world experiments to show that Bob's transfer performance increases with task and feature correlation, and that meaningful transfer can occur even at zero correlation due to network initialization and architecture. The study introduces a controllable data-generation framework using concatenated images and analyzes layer-wise fine-tuning and attribution to understand where transfer comes from. Real-world case studies with dog-vs-cat and Open Images demonstrate semantic transfer under correlated tasks, with clear guidance on when to fine-tune more layers. Overall, the findings offer practical insights for selecting backbones and fine-tuning strategies based on task similarity and data structure.

Abstract

Pre-trained neural networks are commonly used and reused in the machine learning community. Alice trains a model for a particular task, and a part of her neural network is reused by Bob for a different task, often to great effect. To what can we ascribe Bob's success? This paper introduces an experimental setup through which factors contributing to Bob's empirical success could be studied in silico. As a result, we demonstrate that Bob might just be lucky: his task accuracy increases monotonically with the correlation between his task and Alice's. Even when Bob has provably uncorrelated tasks and input features from Alice's pre-trained network, he can achieve significantly better than random performance due to Alice's choice of network and optimizer. When there is little correlation between tasks, only reusing lower pre-trained layers is preferable, and we hypothesize the converse: that the optimal number of retrained layers is indicative of task and feature correlation. Finally, we show in controlled real-world scenarios that Bob can effectively reuse Alice's pre-trained network if there are semantic correlations between his and Alice's task.

Paper Structure

This paper contains 16 sections, 5 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Alice's weights ${\bm{w}}^*$ in equation \ref{['eq:alice-basic-nn']} as a function of $\alpha$. At $\alpha = 0$ Alice recovers ${\bm{w}}^* \propto [1, 0]$, but as $|\alpha| \to 1$ she incorporates more of Bob's feature $x_2$ until ${\bm{w}}^* \propto [1, 1]$ or ${\bm{w}}^* \propto [1, -1]$.
  • Figure 2: Bob's accuracy for Scenario 1 (S1) and Scenario 2 (S2), when he uses Alice's network "as is" with no fine-tuning (Alice's ${\bm{w}}^*$ and $v^*$ in equation \ref{['eq:alice-basic-nn']}), and when he uses Alice's ${\bm{w}}^*$ and fine-tunes $v$ for his task. $k$ is the number of input features tied to Alice's task and to Bob's task.
  • Figure 3: Two example images ${\bm{x}}$ with their task labels. Both used MNIST for $\mathcal{D}^{\mathrm{left}}$, and (a) MNIST and (b) SVHN for $\mathcal{D}^{\mathrm{right}}$.
  • Figure 4: Bob's accuracy after fine-tuning only the output layer of Alice's networks for his task (equation \ref{['eq:output-layer-retrained']}), as functions of $\beta$ and different choices of $\mathcal{D}^{\mathrm{left}}$ and $\mathcal{D}^{\mathrm{right}}$.
  • Figure 5: Bob's accuracy after freezing the first $\ell-1$ layers of Alice's pre-trained network and fine-tuning layer $\ell$ to the output (equation \ref{['eq:many-layers-retrained']}), for the fully-connected and convolutional networks, and for different values of the task correlation parameter $\beta$.
  • ...and 5 more figures