Table of Contents
Fetching ...

Quantifying and Bridging the Fidelity Gap: A Decisive-Feature Approach to Comparing Synthetic and Real Imagery

Danial Safaei, Siddartha Khastgir, Mohsen Alirezaei, Jeroen Ploeg, Son Tong, Xingyu Zhao

TL;DR

This work addresses the sim-to-real fidelity gap in avatar-based autonomous-vehicle testing by introducing Decisive Feature Fidelity (DFF), a SUT-specific measure of mechanism parity that compares the decisive features driving a SUT's decisions in real and synthetic domains using explainable AI. It presents a practical DFF estimator based on counterfactual explanations and a DFF-guided calibration objective that tunes the synthetic data generator to minimize mechanism gaps while preserving output performance. Experiments on 2126 KITTI–VirtualKITTI2 pairs across three SUT heads show that DFF reveals discrepancies not captured by traditional input- or output-focused metrics, and that DFF-guided calibration improves decisive-feature alignment and input fidelity without non-inferiorly degrading task outputs. The approach advocates using DFF alongside conventional fidelity checks to enable more trustworthy virtual testing and more effective simulator calibration.

Abstract

Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on the same causal evidence in both real and simulated environments - not just whether images "look real" to humans. This paper addresses the lack of such a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity - the agreement in causal evidence underlying the SUT's decisions across domains. DFF leverages explainable-AI (XAI) methods to identify and compare the decisive features driving the SUT's outputs for matched real-synthetic pairs. We further propose practical estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

Quantifying and Bridging the Fidelity Gap: A Decisive-Feature Approach to Comparing Synthetic and Real Imagery

TL;DR

This work addresses the sim-to-real fidelity gap in avatar-based autonomous-vehicle testing by introducing Decisive Feature Fidelity (DFF), a SUT-specific measure of mechanism parity that compares the decisive features driving a SUT's decisions in real and synthetic domains using explainable AI. It presents a practical DFF estimator based on counterfactual explanations and a DFF-guided calibration objective that tunes the synthetic data generator to minimize mechanism gaps while preserving output performance. Experiments on 2126 KITTI–VirtualKITTI2 pairs across three SUT heads show that DFF reveals discrepancies not captured by traditional input- or output-focused metrics, and that DFF-guided calibration improves decisive-feature alignment and input fidelity without non-inferiorly degrading task outputs. The approach advocates using DFF alongside conventional fidelity checks to enable more trustworthy virtual testing and more effective simulator calibration.

Abstract

Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on the same causal evidence in both real and simulated environments - not just whether images "look real" to humans. This paper addresses the lack of such a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity - the agreement in causal evidence underlying the SUT's decisions across domains. DFF leverages explainable-AI (XAI) methods to identify and compare the decisive features driving the SUT's outputs for matched real-synthetic pairs. We further propose practical estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Top: three levels of the fidelity spectrum (IV, LF, OV) from chengInstanceLevelSafetyAwareFidelity2024. Bottom: our extension that adds decisive-feature fidelity (DFF) and a calibrator to adjust the generator. Safety-aware fidelity (SA) is defined in chengInstanceLevelSafetyAwareFidelity2024 but is not explicitly visualized here.
  • Figure 2: Decisive-Feature Fidelity across SUTs on the same dataset. (a) Distribution of DFF distances $D$ for Steering, YOLOP--DA, and YOLOP--LL; vertical dashed lines mark each SUT's $\varepsilon_{95}$ threshold. (b) Empirical CDFs $P(D\le\varepsilon)$; the horizontal dashed line indicates 95% coverage, and markers show where each CDF crosses this level. The three curves differ substantially, confirming that DFF is SUT-specific: heads processing the same frames exhibit different decisive-feature mismatch profiles. While OV losses are minimal (see Table \ref{['tab:rq1-pilot']}), DFF retains substantial spread, indicating mechanism differences not captured by IV/OV.
  • Figure 3: Qualitative comparison of calibration methods on a representative KITTI--Virtual KITTI 2 pair. (a) Real KITTI input. (b) Uncalibrated baseline synthetic. (c)--(e) OVF-calibrated outputs for Steering, Drivable Area, and Lane Lines respectively. (f)--(h) DFF-calibrated outputs for the same three SUTs. Each synthetic panel displays its IV score (labeled "IV", computed as $1-\text{LPIPS}$), OV score, and DFF distance (labeled "D"). Cyan boxes (Region A) highlight road-surface texture; magenta boxes (Region B) highlight sky and horizon. DFF calibration produces more realistic road texture and atmospheric rendering in the decisive regions, whereas OVF calibration optimises output similarity without necessarily improving perceptual realism in SUT-critical areas.
  • Figure 4: Zoomed comparison of decisive regions from Fig. \ref{['fig:qualitative']}. Region A (Road): Baseline shows unrealistic flat-green rendering; DFF-calibrated variants recover asphalt texture and lane markings. Region B (Sky): Baseline sky lacks atmospheric variation; DFF calibration restores realistic sky gradients and foliage detail.