Table of Contents
Fetching ...

Importance Sampling Optimization with Laplace Principle

Radu-Alexandru Dragomir, François Portier, Victor Priser

Abstract

Grid search and random search are widely used techniques for hyperparameter tuning in machine learning, especially when gradient information is unavailable. In these methods, a finite set of candidate configurations is evaluated, and the best-performing one is selected. We propose a simple and computationally inexpensive refinement of this paradigm: instead of selecting a single best point, we form a weighted average of the evaluated configurations, where the weights are chosen using an importance sampling scheme inspired by the Laplace principle. This scheme can be implemented as a post-processing step on top of a random search, with no additional function evaluations. We also propose an iterative variant, where the sampling distributions are chosen adaptively to generate new candidate points around the previous estimate, in the spirit of Evolution Strategy (ES) methods. In a general non-convex setting, we show that, after n evaluations, the error of the proposed methods is of smaller order than n -2/(d+2) . This compares favorably to random search or grid search rates of n -1/d as soon as d > 2. We illustrate the practical benefits of this averaging strategy on several examples.

Importance Sampling Optimization with Laplace Principle

Abstract

Grid search and random search are widely used techniques for hyperparameter tuning in machine learning, especially when gradient information is unavailable. In these methods, a finite set of candidate configurations is evaluated, and the best-performing one is selected. We propose a simple and computationally inexpensive refinement of this paradigm: instead of selecting a single best point, we form a weighted average of the evaluated configurations, where the weights are chosen using an importance sampling scheme inspired by the Laplace principle. This scheme can be implemented as a post-processing step on top of a random search, with no additional function evaluations. We also propose an iterative variant, where the sampling distributions are chosen adaptively to generate new candidate points around the previous estimate, in the spirit of Evolution Strategy (ES) methods. In a general non-convex setting, we show that, after n evaluations, the error of the proposed methods is of smaller order than n -2/(d+2) . This compares favorably to random search or grid search rates of n -1/d as soon as d > 2. We illustrate the practical benefits of this averaging strategy on several examples.

Paper Structure

This paper contains 29 sections, 10 theorems, 82 equations, 7 figures, 2 algorithms.

Key Result

Theorem 1

Let $f$ and $q_0$ satisfy Assumption hyp:f as well as certain integrability conditions at infinity. Then, for $\alpha,n$ large enough, the output $x_n$ of Algorithm algo:IS after $n$ iterations satisfies where $C,C'$ are constants depending on $f$ and $q_0$. $\blacktriangleleft$$\blacktriangleleft$

Figures (7)

  • Figure 1: The plots on the top refer to static non-adaptive methods, where the initial distribution is close to $x^\ast$ and the toy functions $f_1, f_2, f_3$ are tested in order. The plots on the bottom refer to adaptive methods, where the initial distribution $q_0$ is far from $x^\ast$. The dimension used is $d=4$, and we refer to the other paragraphs in this section for additional information on the experimental parameters. Moreover, Section \ref{['sec:AdditionalPlot']} provides the plots at a larger scale, with additional choices of the dimension $d$.
  • Figure 2: Function Square ($f_1$) in the static case.
  • Figure 3: Function Rastrigin ($f_2$) in the static case.
  • Figure 4: Function Ackley ($f_3$) in the static case.
  • Figure 5: Function Square ($f_1$) in the adaptive case.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Theorem 1: LISO, informal
  • Lemma 1: Laplace principle from kirwin2010higherasymptoticslaplacesapproximation, Theorem 1.1
  • Theorem 2
  • Corollary 1
  • Proposition 1
  • Theorem 3: general bound for $\pi_{n,\alpha} ^{(A)} (\varphi)$
  • Corollary 2
  • Lemma 2
  • Lemma 3
  • proof
  • ...and 2 more