Table of Contents
Fetching ...

Learning Hyperparameters via a Data-Emphasized Variational Objective

Ethan Harvey, Mikhail Petrov, Michael C. Hughes

Abstract

When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.

Learning Hyperparameters via a Data-Emphasized Variational Objective

Abstract

When training large models on limited data, avoiding overfitting is paramount. Common grid search or smarter search methods rely on expensive separate runs for each candidate hyperparameter, while carving out a validation set that reduces available training data. In this paper, we study gradient-based learning of hyperparameters via the evidence lower bound (ELBO) objective from Bayesian variational methods. This avoids the need for any validation set. We focus on scenarios where the model is over-parameterized for flexibility and the approximate posterior is chosen to be Gaussian with isotropic covariance for tractability, even though it cannot match the true posterior. In such scenarios, we find the ELBO prioritizes posteriors that match the prior, leading to severe underfitting. Instead, we recommend a data-emphasized ELBO that upweights the likelihood but not the prior. In Bayesian transfer learning of image and text classifiers, our method reduces the 88+ hour grid search of past work to under 3 hours while delivering comparable accuracy. We further demonstrate how our approach enables efficient yet accurate approximations of Gaussian processes with learnable lengthscale kernels.

Paper Structure

This paper contains 36 sections, 45 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Panels (a)-(b): Comparing approximate posteriors $q$ trained for ELBO ( pink) and DE-ELBO (ours, purple). Task: ResNet-50 with $D > 1\text{e}6$ trained on CIFAR-10 with $N = 1000$. (a) shows which objective prefers which $q$ across $\eta = \lambda = \tau$. (b) shows which terms in each objective matter most over training steps. Takeaway: When $D \gg N$, the ELBO prefers simpler $q$ ( pink) close to the prior, while our DE-ELBO favors $q$ with higher test accuracy ( purple).Panel (c): Test accuracy and negative log-likelihood (NLL, lower is better) for $q$ trained via our DE-ELBO with various $\kappa$ values. Task: ConvNeXt-Tiny with $D > 1\text{e}6$ trained on CIFAR-10 with $N = 50000$. Takeaway: Set $\kappa = \frac{D}{N}$.
  • Figure 2: Predictions using $\psi,\eta$ selected by different objectives for RFF regression. We show diagEF LA-LML immer2021scalable, diagEF LA-CLML lotfi2022bayesian, iso ELBO, iso DE-ELBO (ours), and the true posterior for $\eta$ that optimize LML. We plot train data $y_{1:N}$ with the mean and two standard deviations of the predictive posterior $p(y_* | y_{1:N})$. Takeaway: DE-ELBO best approximates the true posterior's mean and variance near data. Variance far from data is underestimated.
  • Figure 3: Test accuracy over time for L2-SP transfer learning of image and text classifiers. We run each method on 3 separate train sets of size $N$ (3 marker styles). Each panel shows a distinct task: ConvNeXt-Tiny fine-tuned on CIFAR-10, Flower-102, and Pet-37; BERT-base fine-tuned on News-4. We compare MAP + GS, MAP + BO hvarfner2024vanilla, diagEF LA-LML immer2021scalable, diagEF LA-CLML lotfi2022bayesian, iso ELBO, and iso DE-ELBO. Takeaway: After just a few hours, iso DE-ELBO reaches as good or better performance at small data sizes and similar performance at large sizes, even when other methods are given many additional hours. Further results in App. \ref{['app:caseB']} examine ConvNeXt-Tiny (Fig. \ref{['fig:convnext_tiny_computational_time_comparison']}), ViT-B/16 (Fig. \ref{['fig:vit_b_16_computational_time_comparison']}), ResNet-50 (Fig. \ref{['fig:resnet_50_computational_time_comparison']}), and BERT-base (Fig. \ref{['fig:bert_base_computational_time_comparison']}).
  • Figure A.1: Left: A Monte Carlo approximation of the RFF predictive posterior $p(y_* | y_{1:N}) = \int_v p(y_* | v) p(v | y_{1:N}) dv$ by sampling from the true posterior $p(v | y_{1:N})$. Right: The closed-form GP predictive posterior $p(y_* | y_{1:N})$.
  • Figure A.2: Demo of hyperparameter sensitivity and selection for RFF models. The first four columns use the RFF regression model with isotropic Gaussian $q$ in Sec. \ref{['sec:caseA_model_definition']}, varying estimation and selection techniques. The last column shows the reference fit of a GP's exact posterior, a gold-standard for this toy data but less scalable. For regression, we plot the mean and two standard deviations for the predictive posterior $p(y_* | y_{1:N})$. Our DE-ELBO objective best approximates the GP, though underestimates variance far from data.
  • ...and 7 more figures