Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training scenarios
Ciaran Bench, Emir Ahmed, Spencer A. Thomas
TL;DR
This work tackles trustworthy image-to-image translation in unpaired settings for medical imaging by comparing cycleGAN and diffusion-based SynDiff and applying uncertainty quantification through Monte Carlo Dropout and deep ensembles. It introduces a data-efficient calibration scheme that uses augmented test sets to relate accuracy metrics like FID to predictive uncertainty (mPSD), enabling assessment of whether the estimated uncertainty meaningfully reflects model doubt. Across mammography and non-medical tasks, the study shows that (i) both models can reduce style-transfer FID while preserving tissue structure (CWSSIM) and (ii) uncertainty estimates correlate with image quality and can highlight hallucinations, supporting cautious deployment in data-scarce clinical domains. The approach provides a practical path toward trustworthy unpaired translation when ground truths are unavailable, though calibration is demonstrated on augmented subsets and requires broader validation and hyperparameter exploration for robust generalization.
Abstract
Mammographic screening is an effective method for detecting breast cancer, facilitating early diagnosis. However, the current need to manually inspect images places a heavy burden on healthcare systems, spurring a desire for automated diagnostic protocols. Techniques based on deep neural networks have been shown effective in some studies, but their tendency to overfit leaves considerable risk for poor generalisation and misdiagnosis, preventing their widespread adoption in clinical settings. Data augmentation schemes based on unpaired neural style transfer models have been proposed that improve generalisability by diversifying the representations of training image features in the absence of paired training data (images of the same tissue in either image style). But these models are similarly prone to various pathologies, and evaluating their performance is challenging without ground truths/large datasets (as is often the case in medical imaging). Here, we consider two frameworks/architectures: a GAN-based cycleGAN, and the more recently developed diffusion-based SynDiff. We evaluate their performance when trained on image patches parsed from three open access mammography datasets and one non-medical image dataset. We consider the use of uncertainty quantification to assess model trustworthiness, and propose a scheme to evaluate calibration quality in unpaired training scenarios. This ultimately helps facilitate the trustworthy use of image-to-image translation models in domains where ground truths are not typically available.
