Search-based DNN Testing and Retraining with GAN-enhanced Simulations
Mohammed Oualid Attaoui, Fabrizio Pastore, Lionel Briand
TL;DR
This work addresses the fidelity gap between simulators and real-world data in safety-critical DNN testing by introducing DESIGNATE, a framework that couples meta-heuristic search with GAN-enhanced simulations to generate realistic, diverse test inputs. It uses two fitness objectives—accuracy (grounded in IoU metrics) and diversity (via feature-based image representations from a ResNet50)—and employs NSGA-II with an archive to explore diverse failure-inducing scenarios. Empirical results in urban road segmentation and Martian terrain detection show DESIGNATE outperforms baselines and state-of-the-art testing methods in both discovering failure cases and enabling more effective retraining, with GAN-generated inputs yielding the most substantial gains. The findings support integrating GANs with simulators for robust DNN testing and retraining in vision-based safety-critical systems, suggesting broader applications in automotive, robotics, and space domains where high-fidelity realism is crucial.
Abstract
In safety-critical systems (e.g., autonomous vehicles and robots), Deep Neural Networks (DNNs) are becoming a key component for computer vision tasks, particularly semantic segmentation. Further, since the DNN behavior cannot be assessed through code inspection and analysis, test automation has become an essential activity to gain confidence in the reliability of DNNs. Unfortunately, state-of-the-art automated testing solutions largely rely on simulators, whose fidelity is always imperfect, thus affecting the validity of test results. To address such limitations, we propose to combine meta-heuristic search, used to explore the input space using simulators, with Generative Adversarial Networks (GANs), to transform the data generated by simulators into realistic input images. Such images can be used both to assess the DNN performance and to retrain the DNN more effectively. We applied our approach to a state-of-the-art DNN performing semantic segmentation and demonstrated that it outperforms a state-of-the-art GAN-based testing solution and several baselines. Specifically, it leads to the largest number of diverse images leading to the worst DNN performance. Further, the images generated with our approach, lead to the highest improvement in DNN performance when used for retraining. In conclusion, we suggest to always integrate GAN components when performing search-driven, simulator-based testing.
