Table of Contents
Fetching ...

Understanding the Transferability of Representations via Task-Relatedness

Akshay Mehra, Yunbei Zhang, Jihun Hamm

TL;DR

A novel analysis that analyzes the transferability of the representations of pre-trained models to downstream tasks in terms of their relatedness to a given reference task leads to an upper bound on transferability in terms of task-relatedness, quantified using the difference between the class priors, labels, and features of the two tasks.

Abstract

The growing popularity of transfer learning, due to the availability of models pre-trained on vast amounts of data, makes it imperative to understand when the knowledge of these pre-trained models can be transferred to obtain high-performing models on downstream target tasks. However, the exact conditions under which transfer learning succeeds in a cross-domain cross-task setting are still poorly understood. To bridge this gap, we propose a novel analysis that analyzes the transferability of the representations of pre-trained models to downstream tasks in terms of their relatedness to a given reference task. Our analysis leads to an upper bound on transferability in terms of task-relatedness, quantified using the difference between the class priors, label sets, and features of the two tasks. Our experiments using state-of-the-art pre-trained models show the effectiveness of task-relatedness in explaining transferability on various vision and language tasks. The efficient computability of task-relatedness even without labels of the target task and its high correlation with the model's accuracy after end-to-end fine-tuning on the target task makes it a useful metric for transferability estimation. Our empirical results of using task-relatedness to select the best pre-trained model from a model zoo for a target task highlight its utility for practical problems.

Understanding the Transferability of Representations via Task-Relatedness

TL;DR

A novel analysis that analyzes the transferability of the representations of pre-trained models to downstream tasks in terms of their relatedness to a given reference task leads to an upper bound on transferability in terms of task-relatedness, quantified using the difference between the class priors, labels, and features of the two tasks.

Abstract

The growing popularity of transfer learning, due to the availability of models pre-trained on vast amounts of data, makes it imperative to understand when the knowledge of these pre-trained models can be transferred to obtain high-performing models on downstream target tasks. However, the exact conditions under which transfer learning succeeds in a cross-domain cross-task setting are still poorly understood. To bridge this gap, we propose a novel analysis that analyzes the transferability of the representations of pre-trained models to downstream tasks in terms of their relatedness to a given reference task. Our analysis leads to an upper bound on transferability in terms of task-relatedness, quantified using the difference between the class priors, label sets, and features of the two tasks. Our experiments using state-of-the-art pre-trained models show the effectiveness of task-relatedness in explaining transferability on various vision and language tasks. The efficient computability of task-relatedness even without labels of the target task and its high correlation with the model's accuracy after end-to-end fine-tuning on the target task makes it a useful metric for transferability estimation. Our empirical results of using task-relatedness to select the best pre-trained model from a model zoo for a target task highlight its utility for practical problems.
Paper Structure (38 sections, 13 theorems, 23 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 13 theorems, 23 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $C:= \left[\frac{P_{R'}(y)}{P_{R}(y)}\right]_{y=1}^{K_R}$ be a vector of probability ratios , $B$ be a $K_T\times K_R$ matrix with $B_{ij}=P(y_{R"}=i|y_{R'}=j)$, $A:\mathcal{Z} \to \mathcal{Z}$ be an invertible linear map of features. Let the classifiers $h_{R'}(z) := h_R(z)$, $h_{R"}(z):=Bh_{R'

Figures (13)

  • Figure 1: Given a pre-trained encoder (e.g., CLIP radford2021learning), how does the performance after fine-tuning it on a reference task (e.g., ImageNet) relate to the performance after fine-tuning it on other tasks? Through a rigorous bound on transferability (Theorem \ref{['theorem:final_bound']}) in terms of the relatedness between a reference and a target task, we show that tasks related to the reference task achieve provably better performance after fine-tuning.
  • Figure 2: : Overview of our task transformation model: A series of transformations are applied to the reference distribution $P_R(z,y)$ and classifier $h_R$ to produce the transformed distribution $P_{R"'}$ and classifier $h_{R"'}$ to explain transferability to the downstream target task. Class-prior transformation ($R\rightarrow R'$) changes the class prior of the reference distribution (e.g., an irrelevant Bee class in $R$ now has smaller prior) followed by label set transformation ($R'\rightarrow R"$) (e.g., to match $\{$Lion, Wolf$\}$ with $\{$Cat, Dog$\}$), followed by feature space transformation ($R"\rightarrow R"'$) to match the feature distribution of the target task $P_T(z,y)$.
  • Figure 3: Task-relatedness (decomposed into its components) produces a small gap to transferability (blue bars). As the task-relatedness between the reference (ImageNet (for CV), DBPedia (for NLP)), and the target tasks (x-axis) increases, the transferability improves. (Note: the label mismatch term is zero in our figures as $B$ is fixed to a sparse matrix, see Sec. \ref{['sec:algorithms_task_transfer']}.)
  • Figure 4: (a) Task-relatedness and transferability are highly correlated across various reference-target pairs. (b) Improving the transferability of an encoder on a reference task (in the plot title) leads to improved transferability of all related target tasks (x-axis). (e.g., compared to the original pre-trained CLIP encoder (PE), a end-to-end fine-tuned CLIP encoder (FFE) on the reference task achieves higher transferability to all related tasks.)
  • Figure 5: Task-relatedness (Ours) remains highly correlated with accuracy after end-to-end fine-tuning on a target task even when using a small percentage of target data unlike other SbTE methods (LogME, Leep, NCE, PACTran, OT-NCE, OTCE, and H-Score) whose correlation is affected significantly. For LogMe, Leep, NCE, OT-NCE, OTCE, and H-score positive correlation is better whereas for PACTran and task-relatedness (ours) negative correlation is better.
  • ...and 8 more figures

Theorems & Definitions (25)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Corollary 1
  • ...and 15 more