Deterministic World Models for Verification of Closed-loop Vision-based Systems
Yuang Geng, Zhuoyang Zhou, Zhongzheng Zhang, Siyuan Pan, Hoang-Dung Tran, Ivan Ruchkin
TL;DR
This work tackles the challenge of verifying safety for vision-based closed-loop control by replacing stochastic camera surrogates with a deterministic decoder that maps physical states directly to images. A dual-loss trained Deterministic World Model (DWM) enables precise, state-driven image generation, which, when combined with Star-based reachability (StarV) and layer-wise propagation, yields tight, scalable verification for image-based controllers. The authors further bridge the surrogate to the real system using conformal prediction to bound trajectory deviations and inflate reachable sets for probabilistic guarantees, achieving rigorous containment of real trajectories with high confidence. Across CartPole, MountainCar, and Pendulum, the approach delivers superior precision, higher F1 scores, and reduced conservatism compared with latent-variable baselines, demonstrating practical impact for safety-critical vision-enabled systems.
Abstract
Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.
