Table of Contents
Fetching ...

Reducing Estimation Uncertainty Using Normalizing Flows and Stratification

Paweł Lorek, Rafał Topolnicki, Tomasz Trzciński, Maciej Zięba, Aleksandra Krystecka

TL;DR

The paper tackles estimating $I=\mathbb{E}[f(\mathbf{X})]$ when the distribution of $\mathbf{X}$ is unknown and only samples are available. It introduces a flow-based density model that maps a Gaussian base to complex data distributions and leverages stratified sampling in the latent space to reduce estimation variance. Two stratification schemes, Cartesian (M1) and spherical (M2), plus high-dimensional approximations (M_rad, High3, Rand3), are developed, with an optimal-allocation scheme to further minimize variance. Empirical results on synthetic and real data, including high-dimensional cases up to $d=128$, show substantial improvements over Crude Monte Carlo and Gaussian mixture models, with training requirements on the order of hundreds to a few thousand samples; the authors provide reproducible code. This work offers a practical and scalable approach to variance reduction in complex, unknown distributions and demonstrates its applicability to high-dimensional estimation tasks.

Abstract

Estimating the expectation of a real-valued function of a random variable from sample data is a critical aspect of statistical analysis, with far-reaching implications in various applications. Current methodologies typically assume (semi-)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow-based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high-dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at https://github.com/rnoxy/flowstrat.

Reducing Estimation Uncertainty Using Normalizing Flows and Stratification

TL;DR

The paper tackles estimating when the distribution of is unknown and only samples are available. It introduces a flow-based density model that maps a Gaussian base to complex data distributions and leverages stratified sampling in the latent space to reduce estimation variance. Two stratification schemes, Cartesian (M1) and spherical (M2), plus high-dimensional approximations (M_rad, High3, Rand3), are developed, with an optimal-allocation scheme to further minimize variance. Empirical results on synthetic and real data, including high-dimensional cases up to , show substantial improvements over Crude Monte Carlo and Gaussian mixture models, with training requirements on the order of hundreds to a few thousand samples; the authors provide reproducible code. This work offers a practical and scalable approach to variance reduction in complex, unknown distributions and demonstrates its applicability to high-dimensional estimation tasks.

Abstract

Estimating the expectation of a real-valued function of a random variable from sample data is a critical aspect of statistical analysis, with far-reaching implications in various applications. Current methodologies typically assume (semi-)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow-based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high-dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at https://github.com/rnoxy/flowstrat.
Paper Structure (37 sections, 2 theorems, 43 equations, 10 figures, 19 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 43 equations, 10 figures, 19 tables, 1 algorithm.

Key Result

lemma thmcounterlemma

We have a following decomposition of a variance of CMC estimator involving variance of proportional allocation estimator:

Figures (10)

  • Figure 1: Results for Example 1: 100 estimations of $I={\mathbb{P}}\,(X_1>1.2, X_2>1.2)$, each (vertical line) resulted from $R=2^{12}$ simulations. 95$\%$ confidence intervals (vertical lines) depicted: orange lines: intervals containing true $I$ (red line), blue lines: those not containing $I$. A green line -- estimation of $I$ from observations. Here ${\mathcal{F}}$ means that samples ${\mathbf{x}}_i$ were sampled from trained flow model, CMC stands for Crude Monte Carlo (i.e., a mean of $f({\mathbf{x}}_i)$, no stratification) and ($\rm{opt}$,M1) denotes specific stratification.
  • Figure 2: Comparison of Cartesian and Spherical stratifications.
  • Figure 3: Example 1: $R=2^{13}$ points and $m=16$ strata. Cartesian (left column) and spherical (right column). Smaller plots: 2D iid standard normal; larger plots: points mapped through ${\mathcal{F}}$, colors denote corresponding strata.
  • Figure 4: Estimates vs. training sample size $n_{\text{train}}$.
  • Figure A1: Example A1: 100 estimations of $I={\mathbb{P}}\,(X_1>1.2, X_2>1.2)$, each from $R=2^{12}$ simulations. 95$\%$ confidence intervals depicted. Red line -- true $I$, green line -- $\hat{Y}_n^{\rm obs}$ , orange lines -- intervals containing $I$; blue lines: those not containing $I$. $A$: percentage of intervals not containing $I$; $B$: average confidence interval length
  • ...and 5 more figures

Theorems & Definitions (4)

  • lemma thmcounterlemma
  • proof
  • theorem A1
  • proof