Table of Contents
Fetching ...

Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images

Damian J. Ruck, Paul Vautravers, Oliver Chalkley, Jake Thomas

TL;DR

This work presents a scalable framework for assessing the realism of synthetic image-editing methods and suggests that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines.

Abstract

Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.

Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images

TL;DR

This work presents a scalable framework for assessing the realism of synthetic image-editing methods and suggests that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines.

Abstract

Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.
Paper Structure (74 sections, 2 equations, 4 figures, 19 tables)

This paper contains 74 sections, 2 equations, 4 figures, 19 tables.

Figures (4)

  • Figure 1: Overview of the experimental framework. Clear-day images are transformed using rule-based or generative AI methods to simulate adverse environmental conditions. Realism is assessed via VLM jury evaluation and embedding-based distributional analysis using CLIP and DINOv3.
  • Figure 2: Example augmentations across methods and target conditions. Top row shows real ACDC images. Subsequent rows show outputs from Rule-based Methods (imgaug, albumentations) and generative AI methods (OpenAI, Gemini, Qwen, Flux).
  • Figure 3: Realism scores by augmentation method and target environmental condition. (A) VLM jury acceptance rates. (B) Negative relative Mahalanobis distance in CLIP embedding space. Grey diamonds indicate baselines from real adverse-condition images. Error bars show 95% bootstrap confidence intervals. Higher values indicate greater realism.
  • Figure 4: Cases of maximal disagreement between VLM jury and Mahalanobis distance. Left (green): images unanimously accepted by VLMs despite high Mahalanobis distance. Right (red): images unanimously rejected by VLMs despite low Mahalanobis distance. Visual inspection suggests VLM judgments better capture perceptual realism.