Table of Contents
Fetching ...

A Robbins--Monro Sequence That Can Exploit Prior Information For Faster Convergence

Siwei Liu, Ke Ma, Stephan M. Goetz

TL;DR

This work introduces a prior-information Robbins–Monro (PI–RM) sequence that integrates a target-point prior, $P_{x_t}(x)$, into the stochastic root-finding framework to accelerate convergence without requiring a regression model. Each PI–RM update combines the prior with a shrinking RM distribution: $x_{i+1}= ext{argmax}_x ig( P_{x_t}(x)\, (x ig| x_i-s_i(y_i-y_t), c_i^2) ig)$ with $c_i=c_0/i$, yielding faster early progress while preserving a.s. convergence for broad priors, including Gaussian, Gaussian mixtures, and KDE-derived priors. The authors provide convergence proofs for linear and nonlinear $f$ under Gaussian priors and extend results to practically arbitrary priors via weighted Gaussian sums, KDEs, and regularity assumptions, complemented by a thorough numerical study. The findings show notable early-term speedups, especially under high observation noise, and they offer a practical guideline for selecting the initial prior spread $c_0$. The approach broadens stochastic approximation by embedding prior information into RM iterations, with potential impact on fast root finding under limited measurements and noisy evaluations.

Abstract

We propose a new method to improve the convergence speed of the Robbins-Monro algorithm by introducing prior information about the target point into the Robbins-Monro iteration. We achieve the incorporation of prior information without the need of a -- potentially wrong -- regression model, which would also entail additional constraints. We show that this prior-information Robbins-Monro sequence is convergent for a wide range of prior distributions, even wrong ones, such as Gaussian, weighted sum of Gaussians, e.g., in a kernel density estimate, as well as bounded arbitrary distribution functions greater than zero. We furthermore analyse the sequence numerically to understand its performance and the influence of parameters. The results demonstrate that the prior-information Robbins-Monro sequence converges faster than the standard one, especially during the first steps, which are particularly important for applications where the number of function measurements is limited, and when the noise of observing the underlying function is large. We finally propose a rule to select the parameters of the sequence.

A Robbins--Monro Sequence That Can Exploit Prior Information For Faster Convergence

TL;DR

This work introduces a prior-information Robbins–Monro (PI–RM) sequence that integrates a target-point prior, , into the stochastic root-finding framework to accelerate convergence without requiring a regression model. Each PI–RM update combines the prior with a shrinking RM distribution: with , yielding faster early progress while preserving a.s. convergence for broad priors, including Gaussian, Gaussian mixtures, and KDE-derived priors. The authors provide convergence proofs for linear and nonlinear under Gaussian priors and extend results to practically arbitrary priors via weighted Gaussian sums, KDEs, and regularity assumptions, complemented by a thorough numerical study. The findings show notable early-term speedups, especially under high observation noise, and they offer a practical guideline for selecting the initial prior spread . The approach broadens stochastic approximation by embedding prior information into RM iterations, with potential impact on fast root finding under limited measurements and noisy evaluations.

Abstract

We propose a new method to improve the convergence speed of the Robbins-Monro algorithm by introducing prior information about the target point into the Robbins-Monro iteration. We achieve the incorporation of prior information without the need of a -- potentially wrong -- regression model, which would also entail additional constraints. We show that this prior-information Robbins-Monro sequence is convergent for a wide range of prior distributions, even wrong ones, such as Gaussian, weighted sum of Gaussians, e.g., in a kernel density estimate, as well as bounded arbitrary distribution functions greater than zero. We furthermore analyse the sequence numerically to understand its performance and the influence of parameters. The results demonstrate that the prior-information Robbins-Monro sequence converges faster than the standard one, especially during the first steps, which are particularly important for applications where the number of function measurements is limited, and when the noise of observing the underlying function is large. We finally propose a rule to select the parameters of the sequence.
Paper Structure (14 sections, 8 theorems, 92 equations, 5 figures)

This paper contains 14 sections, 8 theorems, 92 equations, 5 figures.

Key Result

Lemma 3.3

When $f(x)=ax$ is linear ($a>0$), for any finite $h \in \mathbb{Z}^{+},$

Figures (5)

  • Figure 1: Illustration of the prior-information Robbins--Monro sequence with the two contributions of prior distribution and Robbins--Monro distribution combined to provide the a-posteriori distribution.
  • Figure 2: The median deviation ($|x_i-x_\textrm{t}|$ or $|x_i^{\mathrm{s}}-x_\textrm{t}|$) of 400,000 runs for four algorithms in the first 20 steps ($c_0$ =0.3, $d$ =1).
  • Figure 3: The median deviation ($|x_i-x_\textrm{t}|$ or $|x_i^{\mathrm{s}}-x_\textrm{t}|$) of 400,000 runs at the $10^{th}$ step for the four algorithms when A $d=0.15$, B $d=0.3$, C $d=0.48$, and D at $20^{th}$ iteration when $d=0.3$.
  • Figure 4: The graph of the optimal $c_0$ of the prior distribution for $d \in \{0.25, 0.75, 1.25, 1.75\}$ and iterations 6--100. Optimal $c_0$ for each $d$ and iteration is the average of 700 medians derived from 6,000 runs. (run the whole algorithm for $700 \times 6,000$ times)
  • Figure 5: The graph of the performance of the prior-information Robbins--Monro sequence with the optimal $c_0$ for $d \in \{0.25, 0.75, 1.25, 1.75\}$ and iterations 6--100. Accuracy gain for each $d$ and iteration is the average of 700 medians derived from 6,000 runs. (run the whole algorithm for $700 \times 6,000$ times)

Theorems & Definitions (21)

  • Definition 3.1
  • Definition 3.2
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Theorem 3.5
  • proof
  • Remark
  • Theorem 4.1
  • ...and 11 more