On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Xuetong Wu; Jonathan H. Manton; Uwe Aickelin; Jingge Zhu

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Xuetong Wu, Jonathan H. Manton, Uwe Aickelin, Jingge Zhu

TL;DR

This work addresses generalization under domain shift in transfer learning by developing information-theoretic bounds that couple the training-data–hypothesis dependence, via mutual information $I(W;Z_i)$, with domain divergence measured by $D(\mu\|\mu')$. It derives ERM, SGD-like, and Gibbs-algorithm bounds, tightens slow-rate results with $(\eta,c)$-central and related conditions to achieve fast rates, and extends the framework to $\phi$-divergences and Wasserstein distances to handle non-absolutely-continuous shifts. A key practical contribution is InfoBoost, an adaptive reweighting algorithm that leverages information measures to balance source and target data for improved transfer performance, demonstrated on synthetic tasks and real datasets. Overall, the paper provides a unified theory and practical tools for analyzing and improving transfer learning under distribution shift, with implications for algorithm design and robust domain adaptation. The results highlight the trade-offs between data quantity in target versus source domains and offer principled guidance for achieving faster convergence and tighter guarantees in transfer learning applications.

Abstract

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(μ\|μ')$ plays an important role in the characterizations where $μ$ and $μ'$ denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as $φ$-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when $μ$ is not absolutely continuous with respect to $μ'$. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

TL;DR

This work addresses generalization under domain shift in transfer learning by developing information-theoretic bounds that couple the training-data–hypothesis dependence, via mutual information

, with domain divergence measured by

. It derives ERM, SGD-like, and Gibbs-algorithm bounds, tightens slow-rate results with

-central and related conditions to achieve fast rates, and extends the framework to

-divergences and Wasserstein distances to handle non-absolutely-continuous shifts. A key practical contribution is InfoBoost, an adaptive reweighting algorithm that leverages information measures to balance source and target data for improved transfer performance, demonstrated on synthetic tasks and real datasets. Overall, the paper provides a unified theory and practical tools for analyzing and improving transfer learning under distribution shift, with implications for algorithm design and robust domain adaptation. The results highlight the trade-offs between data quantity in target versus source domains and offer principled guidance for achieving faster convergence and tighter guarantees in transfer learning applications.

Abstract

plays an important role in the characterizations where

and

denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as

-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when

is not absolutely continuous with respect to

. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.

Paper Structure (39 sections, 26 theorems, 201 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 26 theorems, 201 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Literature Review
Information-theoretic analysis for machine learning
Improvements on information-theoretic bounds
Transfer learning bounds and comparisons
Problem formulation and main results
Empirical risk minimization
Upper bound on the generalization error
Upper bound on the excess risk of ERM
Fast rate upper bound on the excess risk of ERM
Applications and Extensions
Generalization error of stochastic noisy iterative algorithms
Generalization error on Gibbs algorithm
Bounding with other divergences
$\phi$-divergence bounds
...and 24 more sections

Key Result

Theorem 1

Assume that the hypothesis $W$ is distributed over $P_{W}$ induced by some algorithm, and the cumulant generating function of the random variable $\ell(W, Z)-\mathbb E\left[\ell(W,Z)\right]$ is upper bounded by $\psi(\lambda)$ in the interval $(b_{-},b_{+})$ under the product distribution $P_W\otime where we define

Figures (7)

Figure 1: Comparisons for testing results of true generalization error(blue), generalization error bound (green), true excess risk (orange) and excess risk bound(red). We set a series of parameters $\alpha=0.5, p^{\prime}=w^{*}=0.1, p=0.9, w_{\mathrm{ERM}}=0.5, T=100$, $K_{S T}(0)=10, \eta(0)=0.1, \sigma_{t}=\sqrt{\theta \eta(t) / t}, \theta=0.001, W(0)=0.3$ and $\delta=0.01$ to be fixed for all experiments. For comparison tests, we set $\beta=0.5, W(0)=0.3$ for the first row, $n=1000, W(0)=0.3$ for the second row, and $\beta=0.5, n=1000$ for the last row, respectively.
Figure 2: The source data $x_i$ are sampled from the truncated Gaussian distribution $\mathcal{N}_{tc} \sim (\mathbf{0},2\mathbf{I})$ while the target data are sampled from the truncated Gaussian distribution $\mathcal{N}_{tc} \sim ((-2,2),\mathbf{I})$. The according label $y \in \{0, 1 \}$, is generated from the Bernoulli distribution with probability $p(1) = \frac{1}{1+e^{-w^Tx}}$, where $w_s = (0.5,-1)$ for the source and $w_t = (-0.5,1.5)$ for the target.
Figure 3: Comparisons for generalization error and excess risk where we fix $n_s = 10000$ and vary $n_t$ by setting $\alpha = \beta$.
Figure 4: We represent the true expected generalization error in (a) along with its bounds in Theorem \ref{['thm:excess']} and Theorem \ref{['thm:central-transfer']}. Here we vary $n$ from 50 to 400. To show the convergence up to the domain divergence, we also plot the quantity $\mathbb{E}_W[R_{\mu'}(W)] - D(\mu\|\mu')$ and the fast rate bound $\frac{1}{c\eta n}\sum_{i=1}^{n}I(W;Z_i)$, along with their reciprocals to show the rate w.r.t. sample size $n$. All results are derived by 2000 experimental repeats.
Figure 5: Effect of $\zeta$ varying from 0 to 10 when fixing $\eta = 1$. The results are averaged over 20 experiments.
...and 2 more figures

Theorems & Definitions (71)

Theorem 1: Generalization error of generic algorithms
Remark 1
Corollary 1: Generalization error with source only
Corollary 2: Generalization error for subgaussian loss functions
Remark 2
Remark 3
Theorem 2: Excess risk of ERM
Example 1: Estimating the mean of Gaussian
Definition 1: $(\eta,c)$-Central Condition
Example 2
...and 61 more

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

TL;DR

Abstract

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (71)