On the Generalization for Transfer Learning: An Information-Theoretic Analysis
Xuetong Wu, Jonathan H. Manton, Uwe Aickelin, Jingge Zhu
TL;DR
This work addresses generalization under domain shift in transfer learning by developing information-theoretic bounds that couple the training-data–hypothesis dependence, via mutual information $I(W;Z_i)$, with domain divergence measured by $D(\mu\|\mu')$. It derives ERM, SGD-like, and Gibbs-algorithm bounds, tightens slow-rate results with $(\eta,c)$-central and related conditions to achieve fast rates, and extends the framework to $\phi$-divergences and Wasserstein distances to handle non-absolutely-continuous shifts. A key practical contribution is InfoBoost, an adaptive reweighting algorithm that leverages information measures to balance source and target data for improved transfer performance, demonstrated on synthetic tasks and real datasets. Overall, the paper provides a unified theory and practical tools for analyzing and improving transfer learning under distribution shift, with implications for algorithm design and robust domain adaptation. The results highlight the trade-offs between data quantity in target versus source domains and offer principled guidance for achieving faster convergence and tighter guarantees in transfer learning applications.
Abstract
Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(μ\|μ')$ plays an important role in the characterizations where $μ$ and $μ'$ denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the \textit{central condition}. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as $φ$-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when $μ$ is not absolutely continuous with respect to $μ'$. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.
