Table of Contents
Fetching ...

BeST -- A Novel Source Selection Metric for Transfer Learning

Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff

TL;DR

BeST tackles the challenge of selecting the most transferable pre-trained source model for a target task with limited data. It introduces a quantization-based task-similarity metric that operates on the source model outputs in a black-box setting, and identifies an optimal quantization level $q^*$ via ternary search to estimate transferability without training. The method yields a score $M$ that correlates strongly with actual transfer performance and delivers substantial runtime savings over full transfer training across MNIST, CIFAR10, and Imagenette experiments, while remaining architecture-indifferent. The work demonstrates robust ranking fidelity, but notes scalability challenges for multiclass settings, pointing to future work to improve scalability and generalization.

Abstract

One of the most fundamental, and yet relatively less explored, goals in transfer learning is the efficient means of selecting top candidates from a large number of previously trained models (optimized for various "source" tasks) that would perform the best for a new "target" task with a limited amount of data. In this paper, we undertake this goal by developing a novel task-similarity metric (BeST) and an associated method that consistently performs well in identifying the most transferrable source(s) for a given task. In particular, our design employs an innovative quantization-level optimization procedure in the context of classification tasks that yields a measure of similarity between a source model and the given target data. The procedure uses a concept similar to early stopping (usually implemented to train deep neural networks (DNNs) to ensure generalization) to derive a function that approximates the transfer learning mapping without training. The advantage of our metric is that it can be quickly computed to identify the top candidate(s) for a given target task before a computationally intensive transfer operation (typically using DNNs) can be implemented between the selected source and the target task. As such, our metric can provide significant computational savings for transfer learning from a selection of a large number of possible source models. Through extensive experimental evaluations, we establish that our metric performs well over different datasets and varying numbers of data samples.

BeST -- A Novel Source Selection Metric for Transfer Learning

TL;DR

BeST tackles the challenge of selecting the most transferable pre-trained source model for a target task with limited data. It introduces a quantization-based task-similarity metric that operates on the source model outputs in a black-box setting, and identifies an optimal quantization level via ternary search to estimate transferability without training. The method yields a score that correlates strongly with actual transfer performance and delivers substantial runtime savings over full transfer training across MNIST, CIFAR10, and Imagenette experiments, while remaining architecture-indifferent. The work demonstrates robust ranking fidelity, but notes scalability challenges for multiclass settings, pointing to future work to improve scalability and generalization.

Abstract

One of the most fundamental, and yet relatively less explored, goals in transfer learning is the efficient means of selecting top candidates from a large number of previously trained models (optimized for various "source" tasks) that would perform the best for a new "target" task with a limited amount of data. In this paper, we undertake this goal by developing a novel task-similarity metric (BeST) and an associated method that consistently performs well in identifying the most transferrable source(s) for a given task. In particular, our design employs an innovative quantization-level optimization procedure in the context of classification tasks that yields a measure of similarity between a source model and the given target data. The procedure uses a concept similar to early stopping (usually implemented to train deep neural networks (DNNs) to ensure generalization) to derive a function that approximates the transfer learning mapping without training. The advantage of our metric is that it can be quickly computed to identify the top candidate(s) for a given target task before a computationally intensive transfer operation (typically using DNNs) can be implemented between the selected source and the target task. As such, our metric can provide significant computational savings for transfer learning from a selection of a large number of possible source models. Through extensive experimental evaluations, we establish that our metric performs well over different datasets and varying numbers of data samples.
Paper Structure (25 sections, 1 theorem, 20 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 20 equations, 16 figures, 11 tables, 1 algorithm.

Key Result

Theorem 4.1

Given that the source and target models are binary classifiers and the source softmax output is represented as a random vector $\textbf{p}=[p_1, p_2]$, if true conditional probability distributions $f_1=f_{(p_2|Y=1)}$ and $f_2=f_{(p_2|Y=2)}$ are bounded, then as $q \rightarrow \infty$, $E[A^{val}(\b

Figures (16)

  • Figure 1: Transfer Learning architecture as concatenation of black-box source with a custom model.
  • Figure 2: Quantization function explained through an example of a 3-class source model and q=3.
  • Figure 3: Policy $\boldsymbol{\pi}^q_*$ explained through example with source and target models as binary classifiers.
  • Figure 4: Train-validation accuracy tradeoff where source and target tasks are binary classifiers. Tar=(1,2) and Src=(2,8) denote that the target and source tasks are to classify images of classes indexed 1 and 2, and 2 and 8 of the respective dataset (MNIST or CIFAR10).
  • Figure 5: Comparison of ranks predicted by metric and ground truth for 3-class source to 2-class target transfer in MNIST-MNIST TL setup with $\sim$ 500 data samples using 5-layer custom model.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • proof
  • Remark