Stochastic Optimization with Optimal Importance Sampling
Liviu Aolaritei, Bart P. G. Van Parys, Henry Lam, Michael I. Jordan
TL;DR
The paper addresses efficient stochastic optimization when the sampling distribution used for gradient estimation must be calibrated together with the decision variables, creating a circular dependency. It proposes a single-loop stochastic approximation method that jointly updates the decision variable and IS parameters via a joint Nesterov's dual averaging scheme, without time-scale separation or nested loops, and proves global convergence plus asymptotic variance optimality within the IS class. Theoretical results show almost sure convergence, finite-time active constraint identification, and CLTs for the coupled iterates, with the averaged decision iterates achieving minimal asymptotic variance as if an oracle IS were available. A numerical example on rare-event quantile estimation confirms substantial variance reductions and practical effectiveness of the approach.
Abstract
Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its effectiveness, the performance of IS is highly sensitive to the choice of the proposal distribution and often requires stochastic calibration. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a fundamental challenge: the decision variable and the importance sampling distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both convergence analysis and variance control. We consider convex stochastic optimization problems with linear constraints and propose a single-loop stochastic approximation algorithm, based on a joint variant of Nesterov's dual averaging, that jointly updates the decision variable and the importance sampling distribution, without time-scale separation or nested optimization. The method is globally convergent and achieves minimal asymptotic variance among stochastic gradient schemes, matching the performance of an oracle sampler adapted to the optimal solution.
