Table of Contents
Fetching ...

Learning via Surrogate PAC-Bayes

Antoine Picard-Weibel, Roman Moscoviz, Benjamin Guedj

TL;DR

A novel principled strategy for building an iterative learning algorithm via the optimisation of a sequence of surrogate training objectives, inherited from PAC-Bayes generalisation bounds is introduced.

Abstract

PAC-Bayes learning is a comprehensive setting for (i) studying the generalisation ability of learning algorithms and (ii) deriving new learning algorithms by optimising a generalisation bound. However, optimising generalisation bounds might not always be viable for tractable or computational reasons, or both. For example, iteratively querying the empirical risk might prove computationally expensive. In response, we introduce a novel principled strategy for building an iterative learning algorithm via the optimisation of a sequence of surrogate training objectives, inherited from PAC-Bayes generalisation bounds. The key argument is to replace the empirical risk (seen as a function of hypotheses) in the generalisation bound by its projection onto a constructible low dimensional functional space: these projections can be queried much more efficiently than the initial risk. On top of providing that generic recipe for learning via surrogate PAC-Bayes bounds, we (i) contribute theoretical results establishing that iteratively optimising our surrogates implies the optimisation of the original generalisation bounds, (ii) instantiate this strategy to the framework of meta-learning, introducing a meta-objective offering a closed form expression for meta-gradient, (iii) illustrate our approach with numerical experiments inspired by an industrial biochemical problem.

Learning via Surrogate PAC-Bayes

TL;DR

A novel principled strategy for building an iterative learning algorithm via the optimisation of a sequence of surrogate training objectives, inherited from PAC-Bayes generalisation bounds is introduced.

Abstract

PAC-Bayes learning is a comprehensive setting for (i) studying the generalisation ability of learning algorithms and (ii) deriving new learning algorithms by optimising a generalisation bound. However, optimising generalisation bounds might not always be viable for tractable or computational reasons, or both. For example, iteratively querying the empirical risk might prove computationally expensive. In response, we introduce a novel principled strategy for building an iterative learning algorithm via the optimisation of a sequence of surrogate training objectives, inherited from PAC-Bayes generalisation bounds. The key argument is to replace the empirical risk (seen as a function of hypotheses) in the generalisation bound by its projection onto a constructible low dimensional functional space: these projections can be queried much more efficiently than the initial risk. On top of providing that generic recipe for learning via surrogate PAC-Bayes bounds, we (i) contribute theoretical results establishing that iteratively optimising our surrogates implies the optimisation of the original generalisation bounds, (ii) instantiate this strategy to the framework of meta-learning, introducing a meta-objective offering a closed form expression for meta-gradient, (iii) illustrate our approach with numerical experiments inspired by an industrial biochemical problem.

Paper Structure

This paper contains 24 sections, 5 theorems, 23 equations, 8 figures, 2 algorithms.

Key Result

Theorem 1

Under assumptions ($A_1$) to ($A_5$), replacing the empirical risk $R$ by the proxy risk leaves the gradient of the objective $\textup{PB}$ invariant, i. e. This result also holds if the approximation space $\mathcal{F}_\theta$ is replaced by $\mathcal{F}_\theta + \mathcal{G} := \{f + g\mid f\in\mathcal{F}_\theta, \mathcal{G}\}$ for any set $\mathcal{G}\subset\text{L}^2(\pi_\theta)$.

Figures (8)

  • Figure 1: Experiments results. \ref{['subfig:optim_perf']} compares the optimisation performance of our algorithm SuPAC-CE with gradient descent approaches on an biochemical calibration task. Optimisation procedures were repeated 20 times; median performance and quantiles 0.2 and 0.8 are represented. \ref{['subfig:meta']} investigates train and test performance of the meta-learning approach of \ref{['sec:Metalearn']}. Mean test performance, as well as quantiles 0.2 and 0.8 for the sequence of built prior is assessed on 40 tasks and compared to the train performance. SuPAC-CE reduced the PAC-Bayes objective to $0.121\pm 0.004$ (avg. risk of posterior of $0.102\pm 0.003$).
  • Figure 2: Overview of SuPAC-CE. At each step, some new predictors are drawn from the current posterior approximation and evaluated (top right figure). All evaluated predictors are then weighted according to the weight of their Voronoi cell (bottom right figure). These weighted evaluations are used to construct an optimal approximation of the score through a linear least square task (bottom left figure). The approximated score is used to update the posterior using a closed form expression (top left figure). This procedure is looped until convergence (center).
  • Figure 3: Preliminary GD optimisation procedures for different choices of hyperparameters. The evaluations of each optimisation procedure was repeated 20 times; the median performance and 0.2 and 0.8 quantiles are represented. The performance of SuPAC-CE is given for comparison.
  • Figure 4: Comparison of the optimisation procedures as performed by SuPAC-CE and gradient descent (GD) for the two selected sets of hyperparameters. Each optimisation procedure was repeated 20 times; the median performance and 0.2 and 0.8 quantiles are represented. SuPAC-CE was performed with hyperparameters $\alpha_{\mathrm{max}} = 0.5$ and $\mathrm{kl_{max}}=1$.
  • Figure 5: Comparison of the optimisation procedures as performed by SuPAC-CE and Nesterov accelerated gradient descent (x axis: number of empirical risk queries). Each optimisation procedure was repeated 8 times; the median performance and 0.2 and 0.8 quantiles are represented. SuPAC-CE was performed with hyperparameters $\alpha_{\mathrm{max}} = 0.5$ and $\mathrm{kl_{max}}=1$. Momentum of $0.5$, $0.9$ and $0.95$ were assessed for Nesterov gradient descent. Both the original step size ($\eta$) parameter as well as twice the step size parameter for gradient descent comparisons were investigated. At twice the step size, all momentum accelerated procedures proved unstable. At the original step size, the momentum tended to increase the stability of the procedure at the cost of speed. All Nesterov accelerated gradient descent procedures assessed were slower than SuPAC-CE
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 1
  • proof
  • Corollary 1
  • Lemma 1
  • Theorem 2
  • Remark 4.1
  • Theorem 3
  • proof