Table of Contents
Fetching ...

Model-Free Assessment of Simulator Fidelity via Quantile Curves

Garud Iyengar, Yu-Shiou Willy Lin, Kaizheng Wang

TL;DR

This work addresses the challenging problem of quantifying the sim-to-real discrepancy for complex, often ML-based simulators, by proposing a model-free method to estimate the full quantile function of the discrepancy between simulated and ground-truth outcomes from finite data. The authors develop a two-step procedure that constructs per-scenario confidence sets and a corresponding pseudo-discrepancy, from which they form a calibrated quantile curve hat V_m with finite-sample guarantees, enabling distribution-level inferences and robust comparisons between simulators. The framework supports practical summaries like AUC_cal and CVaR_cal and extends to pairwise simulator comparison, demonstrated on WorldValueBench with four LLMs to profile fidelity against human survey data. The results illustrate varying fidelity across models and provide insights into the tightness of the calibration, while the discussion outlines extensions to tighten bounds and handle dynamic or shifted data settings for broader applicability.

Abstract

Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.

Model-Free Assessment of Simulator Fidelity via Quantile Curves

TL;DR

This work addresses the challenging problem of quantifying the sim-to-real discrepancy for complex, often ML-based simulators, by proposing a model-free method to estimate the full quantile function of the discrepancy between simulated and ground-truth outcomes from finite data. The authors develop a two-step procedure that constructs per-scenario confidence sets and a corresponding pseudo-discrepancy, from which they form a calibrated quantile curve hat V_m with finite-sample guarantees, enabling distribution-level inferences and robust comparisons between simulators. The framework supports practical summaries like AUC_cal and CVaR_cal and extends to pairwise simulator comparison, demonstrated on WorldValueBench with four LLMs to profile fidelity against human survey data. The results illustrate varying fidelity across models and provide insights into the tightness of the calibration, while the discussion outlines extensions to tighten bounds and handle dynamic or shifted data settings for broader applicability.

Abstract

Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.

Paper Structure

This paper contains 30 sections, 11 theorems, 110 equations, 11 figures, 1 table.

Key Result

Theorem 3.1

Suppose Assumption ass:indep and ass:discr hold. For any simulation sample size $k \in \mathbb{N}$, define the per-scenario discrepancy (unobservable) $\Delta_j^{(k)}$ and the pseudo-discrepancy (observable) $\hat{\Delta}_j^{(k)}$ by where $\mathcal{C}_j\!\left(\hat{p}_j\right) \subset\Theta$ are data-driven compact confidence sets satisfying $\mathbb{P} (p_j\in \mathcal{C}_j\!\left(\hat{p}_j\rig

Figures (11)

  • Figure 1: Simulation Uncertainty Quantification.
  • Figure 2: Example of World Value Questions. Retrieved from WVS_Wave7_2020.
  • Figure 3: Calibrated $V(\alpha)$ across LLMs.
  • Figure 4: Robustness check of simulator performance under different $n$-levels.
  • Figure 5: Tightness analysis of different $n_j$ under GPT-4o.
  • ...and 6 more figures

Theorems & Definitions (18)

  • Example 3.1: Multinomial Confidence Set
  • Example 3.2: Bounded Outcomes
  • Example 3.3: Bernoulli Confidence Set
  • Example 3.4: Nonparametric $W_1$ Confidence Set
  • Theorem 3.1
  • Remark 1
  • Theorem 3.2
  • Theorem 5.1
  • Remark 2
  • Lemma A.1
  • ...and 8 more