Table of Contents
Fetching ...

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Xuetong Wu, Jonathan H. Manton, Uwe Aickelin, Jingge Zhu

TL;DR

This work addresses generalization under domain shift in transfer learning by developing information-theoretic bounds that couple the training-data–hypothesis dependence, via mutual information $I(W;Z_i)$, with domain divergence measured by $D(\mu\|\mu')$. It derives ERM, SGD-like, and Gibbs-algorithm bounds, tightens slow-rate results with $(\eta,c)$-central and related conditions to achieve fast rates, and extends the framework to $\phi$-divergences and Wasserstein distances to handle non-absolutely-continuous shifts. A key practical contribution is InfoBoost, an adaptive reweighting algorithm that leverages information measures to balance source and target data for improved transfer performance, demonstrated on synthetic tasks and real datasets. Overall, the paper provides a unified theory and practical tools for analyzing and improving transfer learning under distribution shift, with implications for algorithm design and robust domain adaptation. The results highlight the trade-offs between data quantity in target versus source domains and offer principled guidance for achieving faster convergence and tighter guarantees in transfer learning applications.

Abstract

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(μ\|μ')$ plays an important role in the characterizations where $μ$ and $μ'$ denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as $φ$-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when $μ$ is not absolutely continuous with respect to $μ'$. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

TL;DR

This work addresses generalization under domain shift in transfer learning by developing information-theoretic bounds that couple the training-data–hypothesis dependence, via mutual information , with domain divergence measured by . It derives ERM, SGD-like, and Gibbs-algorithm bounds, tightens slow-rate results with -central and related conditions to achieve fast rates, and extends the framework to -divergences and Wasserstein distances to handle non-absolutely-continuous shifts. A key practical contribution is InfoBoost, an adaptive reweighting algorithm that leverages information measures to balance source and target data for improved transfer performance, demonstrated on synthetic tasks and real datasets. Overall, the paper provides a unified theory and practical tools for analyzing and improving transfer learning under distribution shift, with implications for algorithm design and robust domain adaptation. The results highlight the trade-offs between data quantity in target versus source domains and offer principled guidance for achieving faster convergence and tighter guarantees in transfer learning applications.

Abstract

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence plays an important role in the characterizations where and denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as -divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when is not absolutely continuous with respect to . Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.
Paper Structure (39 sections, 26 theorems, 201 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 26 theorems, 201 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Assume that the hypothesis $W$ is distributed over $P_{W}$ induced by some algorithm, and the cumulant generating function of the random variable $\ell(W, Z)-\mathbb E\left[\ell(W,Z)\right]$ is upper bounded by $\psi(\lambda)$ in the interval $(b_{-},b_{+})$ under the product distribution $P_W\otime where we define

Figures (7)

  • Figure 1: Comparisons for testing results of true generalization error(blue), generalization error bound (green), true excess risk (orange) and excess risk bound(red). We set a series of parameters $\alpha=0.5, p^{\prime}=w^{*}=0.1, p=0.9, w_{\mathrm{ERM}}=0.5, T=100$, $K_{S T}(0)=10, \eta(0)=0.1, \sigma_{t}=\sqrt{\theta \eta(t) / t}, \theta=0.001, W(0)=0.3$ and $\delta=0.01$ to be fixed for all experiments. For comparison tests, we set $\beta=0.5, W(0)=0.3$ for the first row, $n=1000, W(0)=0.3$ for the second row, and $\beta=0.5, n=1000$ for the last row, respectively.
  • Figure 2: The source data $x_i$ are sampled from the truncated Gaussian distribution $\mathcal{N}_{tc} \sim (\mathbf{0},2\mathbf{I})$ while the target data are sampled from the truncated Gaussian distribution $\mathcal{N}_{tc} \sim ((-2,2),\mathbf{I})$. The according label $y \in \{0, 1 \}$, is generated from the Bernoulli distribution with probability $p(1) = \frac{1}{1+e^{-w^Tx}}$, where $w_s = (0.5,-1)$ for the source and $w_t = (-0.5,1.5)$ for the target.
  • Figure 3: Comparisons for generalization error and excess risk where we fix $n_s = 10000$ and vary $n_t$ by setting $\alpha = \beta$.
  • Figure 4: We represent the true expected generalization error in (a) along with its bounds in Theorem \ref{['thm:excess']} and Theorem \ref{['thm:central-transfer']}. Here we vary $n$ from 50 to 400. To show the convergence up to the domain divergence, we also plot the quantity $\mathbb{E}_W[R_{\mu'}(W)] - D(\mu\|\mu')$ and the fast rate bound $\frac{1}{c\eta n}\sum_{i=1}^{n}I(W;Z_i)$, along with their reciprocals to show the rate w.r.t. sample size $n$. All results are derived by 2000 experimental repeats.
  • Figure 5: Effect of $\zeta$ varying from 0 to 10 when fixing $\eta = 1$. The results are averaged over 20 experiments.
  • ...and 2 more figures

Theorems & Definitions (71)

  • Theorem 1: Generalization error of generic algorithms
  • Remark 1
  • Corollary 1: Generalization error with source only
  • Corollary 2: Generalization error for subgaussian loss functions
  • Remark 2
  • Remark 3
  • Theorem 2: Excess risk of ERM
  • Example 1: Estimating the mean of Gaussian
  • Definition 1: $(\eta,c)$-Central Condition
  • Example 2
  • ...and 61 more