Table of Contents
Fetching ...

On the Generalization Ability of Unsupervised Pretraining

Yuyang Deng, Junyuan Hong, Jiayu Zhou, Mehrdad Mahdavi

TL;DR

The paper addresses how unsupervised pretraining influences the generalization of fine-tuned models under task heterogeneity. It introduces a formal framework with representation transferrability and representation-induced Rademacher complexity, yielding a generalization bound that links downstream performance to pretraining quality and distribution mismatch. The authors instantiate the theory on Context Encoder and Masked Autoencoder pretraining, deriving transferrability guarantees and MAE/CE-specific bounds, and propose RadReg, a Rademacher-based regularization method with convergence guarantees to improve downstream generalization using unlabeled data. Empirically, RadReg improves fine-tuning performance on MAE pipelines (e.g., CIFAR-10 and STL-10) and accelerates convergence, suggesting practical design principles for more effective unsupervised pretraining and transfer learning.

Abstract

Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.

On the Generalization Ability of Unsupervised Pretraining

TL;DR

The paper addresses how unsupervised pretraining influences the generalization of fine-tuned models under task heterogeneity. It introduces a formal framework with representation transferrability and representation-induced Rademacher complexity, yielding a generalization bound that links downstream performance to pretraining quality and distribution mismatch. The authors instantiate the theory on Context Encoder and Masked Autoencoder pretraining, deriving transferrability guarantees and MAE/CE-specific bounds, and propose RadReg, a Rademacher-based regularization method with convergence guarantees to improve downstream generalization using unlabeled data. Empirically, RadReg improves fine-tuning performance on MAE pipelines (e.g., CIFAR-10 and STL-10) and accelerates convergence, suggesting practical design principles for more effective unsupervised pretraining and transfer learning.

Abstract

Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
Paper Structure (41 sections, 25 theorems, 177 equations, 1 figure, 2 tables)

This paper contains 41 sections, 25 theorems, 177 equations, 1 figure, 2 tables.

Key Result

Theorem 1

Assume $\hat{h}$ and $\hat{g}$ are the pre-trained representation function and its associated decoder function, and real valued non-negative loss $\phi$ to be $G_\phi$ Lipschitz and bounded by $B_\phi$. Assume pre-training and fine-tuning task admit $(C_\beta, \beta)$ representation transferrability where $h^*_{\mathcal{U}} = \arg\min_{h \in \mathcal{H}} \min_{g\in\mathcal{G}} \mathcal{L}_{\mathc

Figures (1)

  • Figure 1: Testing and training accuracy by epochs, averaged by three repetitions.

Theorems & Definitions (45)

  • Definition 1: Representation-induced Rademacher complexity
  • Definition 2: Represnetation transferability
  • Theorem 1
  • Remark 1
  • Lemma 1
  • Theorem 2
  • Lemma 2
  • Lemma 3: Generalization of MAE pre-training task
  • Theorem 3
  • Definition 3: Moreau Envelope
  • ...and 35 more