Insight from the Kullback--Leibler divergence into adaptive importance sampling schemes for rare event analysis in high dimension
Jason Beh, Yonatan Shadmi, Florian Simatos
TL;DR
The paper investigates adaptive importance sampling for rare-event probabilities in high dimensions, focusing on two AIS families: the cross-entropy (CE) method and projection-based projection densities ${g_{proj}}$ (and their estimators). It proves that if the adaptation sample size ${n_g}$ grows polynomially with the dimension and the rare-event probability ${p_f(A)}$ is bounded away from zero, then the AIS estimators are high-dimensional efficient and weight degeneracy is avoided, contrary to common belief. For projection methods, efficiency can be achieved with ${n_g\gg rd}$, highlighting the advantage of low-dimensional projections in high-dimensional settings; in particular, using ${r=d}$ recovers the optimal Gaussian ${g_A}$, and ${\hat g_A}$ requires ${n_g\gg d^2}$. The CE framework is shown to require a polynomial growth rate (with an explicit dependence on the smallest eigenvalue of covariance estimates) to guarantee efficiency, while a simple computational framework for projection methods makes the results transparent. Overall, the work provides KL-divergence-based conditions and CD-type tail bounds that explain when AIS can beat the curse of dimensionality in rare-event analysis and offers insight into the trade-offs between projection dimension, adaptation sample size, and estimator accuracy.
Abstract
We study two adaptive importance sampling schemes for estimating the probability of a rare event in the high-dimensional regime $d \to \infty$ with $d$ the dimension. The first scheme is the prominent cross-entropy (CE) method, and the second scheme, motivated by recent results, uses as auxiliary distribution a projection of the optimal auxiliary distribution on a lower dimensional subspace. In these schemes, two samples are used: the first one to learn the auxiliary distribution and the second one, drawn according to the learned distribution, to perform the final probability estimation. Contrary to the common belief that the sample size needs to grow exponentially in the dimension to make the estimator consistent and avoid the weight degeneracy phenomenon, we find that a polynomial sample size in the first learning step is enough. We prove this result assuming that the sought probability is bounded away from 0. For CE, insight is provided on the polynomial growth rate which remains implicit. In contrast, we study the second scheme in a simple computational framework assuming that samples from the conditional distribution are available. This makes it possible to show that the sample size only needs to grow like $rd$ with $r$ the effective dimension of the projection, which highlights the potential benefits of these projection methods.
