Table of Contents
Fetching ...

Quantifying Dataset Similarity to Guide Transfer Learning

Shudong Sun, Hao Helen Zhang

TL;DR

The paper tackles the problem of predicting when transfer learning will be beneficial by introducing the Cross-Learning Score (CLS), a label-aware, bidirectional measure of dataset similarity based on generalization performance between source and target domains. CLS links to the cosine similarity between decision boundaries in theory (e.g., probit/LDA settings) and is designed to be computationally efficient rather than relying on high-dimensional density estimation. It provides a practical framework that partitions source datasets into positive, ambiguous, and negative transfer zones and extends to encoder–head architectures for modern deep-transfer pipelines. Through extensive synthetic experiments and real-world tests (eICU mortality prediction and canine image classification), CLS reliably predicts transfer outcomes and guides data selection for transfer learning, offering a principled, scalable tool for transferability assessment.

Abstract

Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.

Quantifying Dataset Similarity to Guide Transfer Learning

TL;DR

The paper tackles the problem of predicting when transfer learning will be beneficial by introducing the Cross-Learning Score (CLS), a label-aware, bidirectional measure of dataset similarity based on generalization performance between source and target domains. CLS links to the cosine similarity between decision boundaries in theory (e.g., probit/LDA settings) and is designed to be computationally efficient rather than relying on high-dimensional density estimation. It provides a practical framework that partitions source datasets into positive, ambiguous, and negative transfer zones and extends to encoder–head architectures for modern deep-transfer pipelines. Through extensive synthetic experiments and real-world tests (eICU mortality prediction and canine image classification), CLS reliably predicts transfer outcomes and guides data selection for transfer learning, offering a principled, scalable tool for transferability assessment.

Abstract

Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.

Paper Structure

This paper contains 36 sections, 3 theorems, 80 equations, 23 figures, 20 tables, 3 algorithms.

Key Result

Theorem 1

Consider a binary classification problem with $\mathbf{X}\!\sim\!\mathcal{N}_p(0,I)$, and the label $Y\in\{0,1\}$ in the target and source tasks follow the probit regression models where $\xi^{(t)}$ and $\xi^{(s)}$ are independent from $\mathcal{N}(0,\sigma^2)$. Then the Cross-Learning Score is given by where $\rho_1 = \frac{\beta^{(t)\top} \beta^{(s)}}{\sqrt{\|\beta^{(t)}\|^2 + \sigma^2} \cdot

Figures (23)

  • Figure 1: Illustration of different feature-response relationships between the target and the source data
  • Figure 2: Illustration of different feature-response relationships between the target and the source data.
  • Figure 3: Illustration of target--source similarity with varying angles between $\beta^{(t)}$ and $\beta^{(s)}$ for Lemma 1
  • Figure 4: CLS changes with cosine similarity in several classification examples.
  • Figure 5: Comparison of CLS vs other similarity metrics under the LDA setting.
  • ...and 18 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • proof
  • proof
  • proof