Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

Michael Sucker; Jalal Fadili; Peter Ochs

Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

Michael Sucker, Jalal Fadili, Peter Ochs

TL;DR

This paper introduces a principled framework to learn optimization algorithms with PAC-Bayesian generalization guarantees, moving beyond worst-case analyses by leveraging data-dependent exponential-family priors and posteriors. The core idea is to bound the true risk of a learned optimizer in terms of its empirical performance plus a KL-divergence term, while allowing a controlled trade-off between convergence speed and convergence guarantees via sublevel probabilities. The authors develop a practical learning procedure that includes imitation-based initialization, a probabilistically constrained sampling scheme, and a Gibbs-posterior update to select hyperparameters, and they validate the approach on quadratics, image processing, Lasso, and neural-network training problems. Results show that the learned optimizers vastly outperform standard baselines under the same iteration budgets while providing an interpretable probabilistic guarantee on performance. The work also discusses limitations, notably that guarantees pertain to the objective after a fixed number of iterations and that the offline training can be computationally intensive, suggesting avenues for future refinement.

Abstract

We use the PAC-Bayesian theory for the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-Bayesian bounds) and explicit trade-off between convergence guarantees and convergence speed, which contrasts with the typical worst-case analysis. Our learned optimization algorithms provably outperform related ones derived from a (deterministic) worst-case analysis. The results rely on PAC-Bayesian bounds for general, possibly unbounded loss-functions based on exponential families. Then, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum. Furthermore, we provide a concrete algorithmic realization of the framework and new methodologies for learning-to-optimize, and we conduct four practically relevant experiments to support our theory. With this, we showcase that the provided learning framework yields optimization algorithms that provably outperform the state-of-the-art by orders of magnitude.

Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

TL;DR

Abstract

Paper Structure (56 sections, 12 theorems, 58 equations, 16 figures, 6 algorithms)

This paper contains 56 sections, 12 theorems, 58 equations, 16 figures, 6 algorithms.

Introduction
Related Work
Broader Context of Learning-to-Optimize
Learning-to-Optimize with Guarantees
Design-Choices in Learning-to-Optimize
PAC-Bayesian Bounds through Change-of-Measure
Boundedness of the Loss Function
Minimization of the PAC-Bound
Choice of the Prior
More Generalization Bounds
Problem Setup & Assumptions
Notation
Main Assumptions and Definitions for Learning Optimization Algorithms
General PAC-Bayesian Theorem
Learning-to-Optimize with Guarantees
...and 41 more sections

Key Result

Theorem 1

Under mild boundedness assumptions on the optimization algorithm, the $\mathbb{Q}$-average population loss $\mathcal{R}_{\textcolor{black}{\sigma}}$ of the algorithm's output can be bounded by the $\mathbb{Q}$-average empirical loss $\hat{\mathcal{R}}_{\textcolor{black}{\sigma}}$ of the algorithm's

Figures (16)

Figure 1: Some numerical results: Loss over iterations (mean as dashed and median as dotted line) of the learned algorithm compared to a standard choice.
Figure 2: Construction of $\Tilde{\mathbb{P}}_{\textcolor{black}{U}}$: On the left, the set $\mathsf{C} \subset \textcolor{black}{\mathscr{U}} \times \textcolor{black}{\mathscr{V}}$ and two of its sections $\mathsf{C}_{\textcolor{black}{u}_1}, \mathsf{C}_{\textcolor{black}{u}_2} \subset \textcolor{black}{\mathscr{V}}$ are visualized. On the right, the function $\rho(\textcolor{black}{u}) = \mathbb{P}_{\textcolor{black}{V}}[\mathsf{C}_{\textcolor{black}{u}}]$, the interval $[\rho_l, \rho_u]$, and the resulting support $\mathrm{supp}(\Tilde{\mathbb{P}}_{\textcolor{black}{U}})$ of $\Tilde{\mathbb{P}}_{\textcolor{black}{U}}$ are visualized. Note that, contrary to the visualization here, $\rho$ can actually be highly discontinuous.
Figure 3: Iterative estimation of $\rho(\textcolor{black}{u})$: The black line shows the density $f_{a^{(n)}, b^{(n)}}$ of $\mathrm{Beta}(a^{(n)}, b^{(n)})$ after having observed $\textcolor{black}{I_1, ..., I_n}$. The red dotted line indicates the true probability $\rho(\textcolor{black}{u})$, which we are trying to estimate, and the blue dashed lines indicate the lower and upper quantiles corresponding to $q_l, q_u$. The procedure stops as soon as $Q_{a^{(n)}, b^{(n)}}(q_u) - Q_{a^{(n)}, b^{(n)}}(q_l) < \varepsilon$, which is indicated by the double-headed arrow. Here, we use $q_l = 0.05, q_u = 0.95, \varepsilon = 0.15$ and $\rho(\textcolor{black}{u}) = 0.8$.
Figure 4: Example for probabilistically constrained sampling: The upper left plot shows the underlying function $\rho(\textcolor{black}{u})$. It is discontinuous and defines a non-convex set $\mathsf{A}$. The upper right plot shows the probabilistically constrained potential ($\rho(\textcolor{black}{u}) \in [0.6, 1]$), from which we want to sample. The lower left plot shows the accepted (black) and the rejected (gray) samples (in a ratio of about 10:1). Further, we can see that some of them are false-positives (dark red) or false-negatives (red). Especially, this happens for $\rho(\textcolor{black}{u}) \approx 0.6$, where the remaining uncertainty can easily lead to a wrong decision. Here, we have chosen the $q_l = 0.01$, $q_u = 0.99$, and $\varepsilon = 0.05$ in Algorithm \ref{['Alg:Estimation_Probabilistic_Constraint']}. Finally, the lower right plot shows the estimated potential.
Figure 5: Learning procedure: 1) Imitation learning. 2) Probabilistically constrained stochastic empirical risk minimization. 3) Construct prior through sampling. 4) Compute posterior by performing the PAC-Bayesian learning step.
...and 11 more figures

Theorems & Definitions (38)

Theorem 1: Informal
Definition 3
Definition 4
Definition 5
Definition 8
Definition 9
Remark 11
Remark 13
Lemma 14
Lemma 15
...and 28 more

Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

TL;DR

Abstract

Learning-to-Optimize with PAC-Bayesian Guarantees: Theoretical Considerations and Practical Implementation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (38)