CLIPping the Limits: Finding the Sweet Spot for Relevant Images in Automated Driving Systems Perception Testing
Philipp Rigoll, Laurenz Adolph, Lennart Ries, Eric Sax
TL;DR
This work tackles the challenge of assembling semantically relevant image subsets for automated driving perception testing by leveraging CLIP to rank images by query similarity and introducing an automatic threshold to convert ranking into a usable partial dataset. It introduces a two-Gaussian mixture model to describe the cosine-distance distribution and derives the threshold from their intersection, with a single-Gaussian fallback when the mixture fails. The method balances false positives and false negatives and includes a fallback procedure to maintain automation, demonstrated on the ACDC dataset with prompts like snow, fog, rain, and night, plus a fallback analysis on traffic lights. The approach reduces manual curation, enabling scalable, automated testing of perception robustness, while acknowledging prompt-dependent variability and outlining directions for prompt optimization and broader model evaluation.
Abstract
Perception systems, especially cameras, are the eyes of automated driving systems. Ensuring that they function reliably and robustly is therefore an important building block in the automation of vehicles. There are various approaches to test the perception of automated driving systems. Ultimately, however, it always comes down to the investigation of the behavior of perception systems under specific input data. Camera images are a crucial part of the input data. Image data sets are therefore collected for the testing of automated driving systems, but it is non-trivial to find specific images in these data sets. Thanks to recent developments in neural networks, there are now methods for sorting the images in a data set according to their similarity to a prompt in natural language. In order to further automate the provision of search results, we make a contribution by automating the threshold definition in these sorted results and returning only the images relevant to the prompt as a result. Our focus is on preventing false positives and false negatives equally. It is also important that our method is robust and in the case that our assumptions are not fulfilled, we provide a fallback solution.
