Table of Contents
Fetching ...

A Markovian Model for Learning-to-Optimize

Michael Sucker, Peter Ochs

TL;DR

A probabilistic model for stochastic iterative algorithms with the use case of optimization algorithms in mind is presented, which allows for learning stochastic algorithms based on their empirical performance and yields results about their actual convergence rate and their actual convergence time.

Abstract

We present a probabilistic model for stochastic iterative algorithms with the use case of optimization algorithms in mind. Based on this model, we present PAC-Bayesian generalization bounds for functions that are defined on the trajectory of the learned algorithm, for example, the expected (non-asymptotic) convergence rate and the expected time to reach the stopping criterion. Thus, not only does this model allow for learning stochastic algorithms based on their empirical performance, it also yields results about their actual convergence rate and their actual convergence time. We stress that, since the model is valid in a more general setting than learning-to-optimize, it is of interest for other fields of application, too. Finally, we conduct five practically relevant experiments, showing the validity of our claims.

A Markovian Model for Learning-to-Optimize

TL;DR

A probabilistic model for stochastic iterative algorithms with the use case of optimization algorithms in mind is presented, which allows for learning stochastic algorithms based on their empirical performance and yields results about their actual convergence rate and their actual convergence time.

Abstract

We present a probabilistic model for stochastic iterative algorithms with the use case of optimization algorithms in mind. Based on this model, we present PAC-Bayesian generalization bounds for functions that are defined on the trajectory of the learned algorithm, for example, the expected (non-asymptotic) convergence rate and the expected time to reach the stopping criterion. Thus, not only does this model allow for learning stochastic algorithms based on their empirical performance, it also yields results about their actual convergence rate and their actual convergence time. We stress that, since the model is valid in a more general setting than learning-to-optimize, it is of interest for other fields of application, too. Finally, we conduct five practically relevant experiments, showing the validity of our claims.
Paper Structure (35 sections, 19 theorems, 94 equations, 14 figures)

This paper contains 35 sections, 19 theorems, 94 equations, 14 figures.

Key Result

Theorem 1

Under mild boundedness assumptions on the algorithm, the $\rho$-average convergence time $\Bar{\tau}$ and the $\rho$-average convergence rate $\Bar{r}$ can be bounded, respectively, by the $\rho$-average empirical convergence time $\hat{\tau}$ plus some remainder term $R_{t, N}$, and the the $\rho$-

Figures (14)

  • Figure 1: Superposition of different sources of randomness: The algorithm can be applied to several problem instances coming from a common distribution (upper left). Since this is not under the control of the user, we refer to it as external randomness. Further, the algorithm might be started from different, randomly chosen initializations (upper right). Furthermore, there might be randomness (or uncertainty) in the choice of the hyperparameters of the algorithm (lower left). Finally, the algorithmic update might be inherently stochastic (lower right), which, as it is inherent to the algorithm, we refer to as internal randomness. Combining these four sources of randomness yields the superposition depicted in the middle.
  • Figure 2: Visualization of kernels and their corresponding operations: The top figure visualizes the distributions $\mu(x_i, \cdot)$, $i=0, ..., 3$, (colored dots) for four selected points $x_0, ..., x_3$. Here, each color represents one distribution $\mu(x_i, \cdot)$, $i = 0, ..., 3$, and the colored lines connecting $x_i$ with each point from the next cluster should represent all the possible outcomes of $\mu(x_i, \cdot)$. The blackish line connecting the points $x_0, ..., x_3$ (and $x_4$) shows that, by selecting one sample from each $\mu(x_i, \cdot)$, a trajectory emerges from this process. The lower left figure visualizes how a distribution $\nu$ (blue) is transformed by the kernel $\mu$ into the distribution $\nu \cdot \mu$ (purple): At each point $x$ we have a distribution $\mu(x, \cdot)$ (represented by several pink dots connected to one blue dot), and by integrating these points w.r.t. $\nu$, the distribution $\nu \cdot \mu$ emerges. The lower right figure shows the distribution $\nu \otimes \mu$. The creation is the same as for $\nu \cdot \mu$, which is the marginal of $\nu \otimes \mu$. However, $\mu \otimes \nu$ is a measure on $S^2$, while $\nu \cdot \mu$ is a measure on $S$.
  • Figure 3: Visualization of the (joint) transition kernel: The upper row shows how the kernel $\gamma(\alpha, \theta, \cdot)$ acts on the initial distribution: The iterative concatenation (upper left) transforms the initial distribution of ${\xi}^{(0)}$ on $S$ (dark blue) into the distributions of ${\xi}^{(t_1)}$ (light blue), ${\xi}^{(t_2)}$ (purple), ${\xi}^{(t_3)}$ (pink), and ${\xi}^{(t_4)}$ (light pink). Similarly, the iterative product (upper middle) transforms the initial distribution on $S$ into a distribution on $S^5$, namely the joint distribution of $({\xi}^{(0)}, {\xi}^{(t_1)}, ..., {\xi}^{(t_4)})$. Then, this yields the unique distribution $\psi(\alpha, \theta, \cdot)$ on $S^{\mathbb{N}_0}$ (upper right) for the whole trajectory (orange lines). The lower row shows the same thing on the space $S^N$, just that the initial distribution now is given by $\bigotimes_{n=1}^N \mathbb{P}_\mathscr{I}$ and the corresponding kernel is $\Gamma(\alpha, \theta_{[N]}, \cdot)$, which acts on all problem instances $\theta_1, ..., \theta_N$ at once.
  • Figure 4: Visualization of the stopping time $\tau$: In the left plot, the convergence set for each problem is shown as shaded region. As soon as a trajectory enters this region (red crosses), the algorithm is stopped. This yields the distribution of $\tau$ depending on the parameter $\theta$, as shown in the right plot.
  • Figure 5: Quadratic: The top figure shows the loss over the iterations, where HBF is shown in blue and the learned algorithm in pink. The mean and median are shown as dashed and dotted lines, respectively, while the shaded region represents the test data up to the quantile $q = 0.95$, that is, 95% of the test data. We can see from the figure that the learned algorithm reaches the convergence criterion way faster than HBF. The lower left plot shows the convergence rate of the learned algorithm. Here, the dashed line represent the empirical mean and the PAC-bound, respectively, and we can see that the bound is not vacuous, but also not really tight. Similarly, the lower right figure shows the convergence time of the learned algorithm. Again, the dashed lines represent the empirical mean and the corresponding PAC-bound, which, in this case, is reasonably tight.
  • ...and 9 more figures

Theorems & Definitions (41)

  • Theorem 1: Informal
  • Remark 2
  • Definition 3
  • Example 5
  • Lemma 6: Variational Formulation by Donsker and Varadhan
  • Definition 7
  • Theorem 8: Monotone Classes
  • Remark 10
  • Example 12
  • Definition 13
  • ...and 31 more