Table of Contents
Fetching ...

Robust Simulation-Based Inference under Missing Data via Neural Processes

Yogesh Verma, Ayush Bharti, Vikas Garg

TL;DR

This work tackles missing data in simulation-based inference (SBI) and shows that naive imputation biases the SBI posterior. It introduces RISE, which jointly learns an imputation model based on Neural Processes and a neural posterior estimator within an amortized framework, enabling robust inference under MAR, MNAR, and MCAR conditions. Empirical results across SBI benchmarks (Ricker, OUP, GLM, GLU) and real bioactivity datasets (Adrenergic and Kinase assays) demonstrate improved posterior accuracy, reliable calibration, and strong imputation performance, including a meta-learning variant (RISE-Meta) that generalizes to unseen missingness levels. These results highlight RISE's practical impact for SBI in real-world scenarios where data are frequently incomplete or corrupted.

Abstract

Simulation-based inference (SBI) methods typically require fully observed data to infer parameters of models with intractable likelihood functions. However, datasets often contain missing values due to incomplete observations, data corruptions (common in astrophysics), or instrument limitations (e.g., in high-energy physics applications). In such scenarios, missing data must be imputed before applying any SBI method. We formalize the problem of missing data in SBI and demonstrate that naive imputation methods can introduce bias in the estimation of SBI posterior. We also introduce a novel amortized method that addresses this issue by jointly learning the imputation model and the inference network within a neural posterior estimation (NPE) framework. Extensive empirical results on SBI benchmarks show that our approach provides robust inference outcomes compared to standard baselines for varying levels of missing data. Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). Code is available at https://github.com/Aalto-QuML/RISE.

Robust Simulation-Based Inference under Missing Data via Neural Processes

TL;DR

This work tackles missing data in simulation-based inference (SBI) and shows that naive imputation biases the SBI posterior. It introduces RISE, which jointly learns an imputation model based on Neural Processes and a neural posterior estimator within an amortized framework, enabling robust inference under MAR, MNAR, and MCAR conditions. Empirical results across SBI benchmarks (Ricker, OUP, GLM, GLU) and real bioactivity datasets (Adrenergic and Kinase assays) demonstrate improved posterior accuracy, reliable calibration, and strong imputation performance, including a meta-learning variant (RISE-Meta) that generalizes to unseen missingness levels. These results highlight RISE's practical impact for SBI in real-world scenarios where data are frequently incomplete or corrupted.

Abstract

Simulation-based inference (SBI) methods typically require fully observed data to infer parameters of models with intractable likelihood functions. However, datasets often contain missing values due to incomplete observations, data corruptions (common in astrophysics), or instrument limitations (e.g., in high-energy physics applications). In such scenarios, missing data must be imputed before applying any SBI method. We formalize the problem of missing data in SBI and demonstrate that naive imputation methods can introduce bias in the estimation of SBI posterior. We also introduce a novel amortized method that addresses this issue by jointly learning the imputation model and the inference network within a neural posterior estimation (NPE) framework. Extensive empirical results on SBI benchmarks show that our approach provides robust inference outcomes compared to standard baselines for varying levels of missing data. Moreover, we demonstrate the merits of our imputation model on two real-world bioactivity datasets (Adrenergic and Kinase assays). Code is available at https://github.com/Aalto-QuML/RISE.

Paper Structure

This paper contains 49 sections, 2 theorems, 27 equations, 9 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

If $\hat{p}(\mathbf{x}_{\text{mis}} \, | \, \mathbf{x}_{\text{obs}})$ is misaligned with $p_{\text{true}}(\mathbf{x}_{\text{mis}} \, | \, \mathbf{x}_{\text{obs}})$, then the estimated SBI posterior $\hat{p}_{\text{SBI}}(\theta \, | \, \mathbf{x}_{\text{obs}})$ will be biased (in general), i.e.

Figures (9)

  • Figure 1: Effect of missing data on SBI. NPE posterior for the two-parameter Ricker model Wood2010 where the method of wang2024missing (with zero augmentation) is used to handle $\varepsilon\%$ of values missing in the data. As $\varepsilon$ increases, the NPE posteriors become biased and drift away from the true parameter value, denoted by the black lines.
  • Figure 2: Plate diagram
  • Figure 3: Posterior estimates for the Hodgkin-Huxley model under MCAR (top row) and MNAR (bottom row) with varying proportions of missing values in the data (denoted by $\varepsilon$). The posteriors obtained from RISE stay close to the true parameter (denoted by the black lines) for all values of $\varepsilon$, while those from the baseline methods move further away as $\varepsilon$ increases.
  • Figure 4: Generalizing over missingness.
  • Figure 5: Imputation RMSE for MCAR (top) and MNAR (bottom) over various synthetic datasets. Here GL refers to a 10 dimension Gaussian linear model, see lueckmann2021benchmarking for details.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 1: SBI posterior under true imputation
  • Definition 2: SBI posterior under estimated imputation
  • Proposition 1
  • Proposition 2: Training objective
  • proof
  • proof