Table of Contents
Fetching ...

Improving Generalization of Complex Models under Unbounded Loss Using PAC-Bayes Bounds

Xitong Zhang, Avrajit Ghosh, Guangliang Liu, Rongrong Wang

TL;DR

A new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning is introduced by establishing a new PAC-Bayes bound for unbounded loss and a theoretically grounded approach that involves jointly training the prior and posterior using the same dataset.

Abstract

Previous research on PAC-Bayes learning theory has focused extensively on establishing tight upper bounds for test errors. A recently proposed training procedure called PAC-Bayes training, updates the model toward minimizing these bounds. Although this approach is theoretically sound, in practice, it has not achieved a test error as low as those obtained by empirical risk minimization (ERM) with carefully tuned regularization hyperparameters. Additionally, existing PAC-Bayes training algorithms often require bounded loss functions and may need a search over priors with additional datasets, which limits their broader applicability. In this paper, we introduce a new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning. This is achieved by establishing a new PAC-Bayes bound for unbounded loss and a theoretically grounded approach that involves jointly training the prior and posterior using the same dataset. Our comprehensive evaluations across various classification tasks and neural network architectures demonstrate that the proposed method not only outperforms existing PAC-Bayes training algorithms but also approximately matches the test accuracy of ERM that is optimized by SGD/Adam using various regularization methods with optimal hyperparameters.

Improving Generalization of Complex Models under Unbounded Loss Using PAC-Bayes Bounds

TL;DR

A new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning is introduced by establishing a new PAC-Bayes bound for unbounded loss and a theoretically grounded approach that involves jointly training the prior and posterior using the same dataset.

Abstract

Previous research on PAC-Bayes learning theory has focused extensively on establishing tight upper bounds for test errors. A recently proposed training procedure called PAC-Bayes training, updates the model toward minimizing these bounds. Although this approach is theoretically sound, in practice, it has not achieved a test error as low as those obtained by empirical risk minimization (ERM) with carefully tuned regularization hyperparameters. Additionally, existing PAC-Bayes training algorithms often require bounded loss functions and may need a search over priors with additional datasets, which limits their broader applicability. In this paper, we introduce a new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning. This is achieved by establishing a new PAC-Bayes bound for unbounded loss and a theoretically grounded approach that involves jointly training the prior and posterior using the same dataset. Our comprehensive evaluations across various classification tasks and neural network architectures demonstrate that the proposed method not only outperforms existing PAC-Bayes training algorithms but also approximately matches the test accuracy of ERM that is optimized by SGD/Adam using various regularization methods with optimal hyperparameters.
Paper Structure (33 sections, 8 theorems, 47 equations, 10 figures, 12 tables, 3 algorithms)

This paper contains 33 sections, 8 theorems, 47 equations, 10 figures, 12 tables, 3 algorithms.

Key Result

Theorem 2.1

maurer2004note Assume the loss function $\ell$ is bounded within the interval $[0,1]$. Given a preset prior distribution $\mathcal{P}$ over the model space $\mathcal{H}$, and given a scalar $\delta\in (0,1)$, for any choice of i.i.d $m$-sized training dataset $\mathcal{S}$ according to $\mathcal{D}$ holds with probability at least $1-\delta$. Here, KL stands for the Kullback-Leibler divergence.

Figures (10)

  • Figure 1: Our definition of $K$ demonstrates a notable advantage in terms of the achieved numerical value compared to the $K$ in the sub-Gaussian and CGF bounds. Our $K$ varies with $\gamma_1$ and the prior standard deviation, while the previous bounds remain constant as they don't depend on these parameters. The vertical dotted line marks the optimal choice of prior std and $\gamma_1$ during training, which we determined by testing various pairs of $\gamma_1$ and prior standard deviation for the combination that yields the smallest value of the bound after optimized over the posterior. The figure shows that near this optimal combination, our $K$ is much smaller than those in the other two bounds. This experiment is conducted on CIFAR10 using CNN9; see details in Sec. \ref{['sec:exp']}.
  • Figure 2: Comparison of the upper bound $\max\{K_1, K_2\}$ for $K_{\min}$, as derived in Lemma \ref{['lm:1']} with the data-driven estimate of $K_{\min}$ obtained via Equation (\ref{['eq:kmin']}) on CNN9 using the CIFAR10 dataset with the prior parameterized as a Gaussian distribution centered at the Kaiming initialization.
  • Figure 3: Minimizing PAC-Bayes bounds based on sub-Gaussian, CGF and our proposed bound on CIFAR10 using CNN9. The test error of a randomly initialized model is shown as initial test error. Minimizing our bound (ours) achieves a better test accuracy compared with optimizing the other two (sub-Gaussian and CGF).
  • Figure 4: Generalization gap (the difference between the training and the testing accuracy) in PAC-Bayes training versus ERM training using ResNet18 on the CIFAR10 dataset. Each point represents an intermediate model during training, plotted according to its test accuracy versus training accuracy. The line $y=x$ indicates the optimal, zero generalization gap. PAC-Bayes training has a smaller generalization gap throughout the training process (Fig. \ref{['fig:gap3']}) and remains stable despite changes in hyperparameters (Fig. \ref{['fig:gap1']}). In constant, ERM training (Fig. \ref{['fig:gap2']}) is very unstable to hyper-parameter changes. When comparing ERM with our method in Fig. \ref{['fig:gap3']}, we picked the best ERM result (the blue one) in Fig. \ref{['fig:gap2']} that achieved the best final test accuracy. The discontinuity it has around the testing accuracy of 87%, is due to the activation of the learning rate scheduler.
  • Figure 5: Test accuracy of GCN. The first and third quartiles construct the interval over the ten random splits. $\{$+val$\}$ denotes the performance with both training and validation datasets for training.
  • ...and 5 more figures

Theorems & Definitions (24)

  • Theorem 2.1
  • Definition 4.1: Exponential moment on finite intervals
  • Remark 4.2: The left-side moment
  • Remark 4.3: $\gamma$ within a finite interval bounded away from 0
  • Remark 4.4: Non-negative loss
  • Lemma 4.5: Comparison with the second-order-moment condition
  • Remark 4.6: Comparison with the first-order-moment condition
  • proof : Proof of Lemma \ref{['lm:1']}
  • Definition 4.7: Exponential moment over hypotheses
  • Remark 4.8: Dependency of $K$ on prior parameters
  • ...and 14 more