Model-Free Assessment of Simulator Fidelity via Quantile Curves
Garud Iyengar, Yu-Shiou Willy Lin, Kaizheng Wang
TL;DR
This work addresses the challenging problem of quantifying the sim-to-real discrepancy for complex, often ML-based simulators, by proposing a model-free method to estimate the full quantile function of the discrepancy between simulated and ground-truth outcomes from finite data. The authors develop a two-step procedure that constructs per-scenario confidence sets and a corresponding pseudo-discrepancy, from which they form a calibrated quantile curve hat V_m with finite-sample guarantees, enabling distribution-level inferences and robust comparisons between simulators. The framework supports practical summaries like AUC_cal and CVaR_cal and extends to pairwise simulator comparison, demonstrated on WorldValueBench with four LLMs to profile fidelity against human survey data. The results illustrate varying fidelity across models and provide insights into the tightness of the calibration, while the discussion outlines extensions to tighten bounds and handle dynamic or shifted data settings for broader applicability.
Abstract
Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
