Table of Contents
Fetching ...

Style transfer as data augmentation: evaluating unpaired image-to-image translation models in mammography

Emir Ahmed, Spencer A. Thomas, Ciaran Bench

TL;DR

The paper tackles the problem of generalisability in mammography by evaluating unpaired image-to-image translation models as style-transfer data augmentation. It compares CycleGAN and SynDiff across three datasets using metrics that quantify both style similarity (FID, KID) and content preservation (MSE, PSNR, SSIM, DISTS, FSIM, CW-SSIM), highlighting that no single metric fully captures model performance. Findings show CycleGAN typically preserves content well, while SynDiff can introduce small offsets that affect common content metrics but can be mitigated with post-processing and tolerant measures; overall, multiple metrics are necessary for robust evaluation. The work provides practical guidance for applying and assessing style-transfer techniques in mammography and offers a framework for interpreting metric signals in the presence of artefacts and preprocessing choices.

Abstract

Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.

Style transfer as data augmentation: evaluating unpaired image-to-image translation models in mammography

TL;DR

The paper tackles the problem of generalisability in mammography by evaluating unpaired image-to-image translation models as style-transfer data augmentation. It compares CycleGAN and SynDiff across three datasets using metrics that quantify both style similarity (FID, KID) and content preservation (MSE, PSNR, SSIM, DISTS, FSIM, CW-SSIM), highlighting that no single metric fully captures model performance. Findings show CycleGAN typically preserves content well, while SynDiff can introduce small offsets that affect common content metrics but can be mitigated with post-processing and tolerant measures; overall, multiple metrics are necessary for robust evaluation. The work provides practical guidance for applying and assessing style-transfer techniques in mammography and offers a framework for interpreting metric signals in the presence of artefacts and preprocessing choices.

Abstract

Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.

Paper Structure

This paper contains 14 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Example image pair of sections of adapted CBIS image patches a) before and b) after registration, for the CBIS $\rightarrow$ VDM task. The adapted CBIS output, after registration, consists of some blur, which could have an impact on SSIM and PSNR scores, c) and d), respectively.