Table of Contents
Fetching ...

Bayesian generative models can flag performance loss, bias, and out-of-distribution image content

Miguel López-Pérez, Marco Miani, Valery Naranjo, Søren Hauberg, Aasa Feragen

TL;DR

Medical imaging often uses generative models, but distribution shifts and out-of-distribution data undermine reliability; current uncertainty quantification for VAEs is limited. The authors introduce SLUG, a Sketched Lanczos Uncertainty Global score that extends Laplace-based uncertainty estimation to VAEs using scalable stochastic trace estimators. SLUG correlates with reconstruction error and racial bias in dermatology datasets and can detect OOD content at the pixel level. This work enables safer deployment of generative models in clinical settings by flagging performance loss, bias, and OOD image content.

Abstract

Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score -- unlike the VAE's encoder variances -- correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.

Bayesian generative models can flag performance loss, bias, and out-of-distribution image content

TL;DR

Medical imaging often uses generative models, but distribution shifts and out-of-distribution data undermine reliability; current uncertainty quantification for VAEs is limited. The authors introduce SLUG, a Sketched Lanczos Uncertainty Global score that extends Laplace-based uncertainty estimation to VAEs using scalable stochastic trace estimators. SLUG correlates with reconstruction error and racial bias in dermatology datasets and can detect OOD content at the pixel level. This work enables safer deployment of generative models in clinical settings by flagging performance loss, bias, and OOD image content.

Abstract

Generative models are popular for medical imaging tasks such as anomaly detection, feature extraction, data visualization, or image generation. Since they are parameterized by deep learning models, they are often sensitive to distribution shifts and unreliable when applied to out-of-distribution data, creating a risk of, e.g. underrepresentation bias. This behavior can be flagged using uncertainty quantification methods for generative models, but their availability remains limited. We propose SLUG: A new UQ method for VAEs that combines recent advances in Laplace approximations with stochastic trace estimators to scale gracefully with image dimensionality. We show that our UQ score -- unlike the VAE's encoder variances -- correlates strongly with reconstruction error and racial underrepresentation bias for dermatological images. We also show how pixel-wise uncertainty can detect out-of-distribution image content such as ink, rulers, and patches, which is known to induce learning shortcuts in predictive models.

Paper Structure

This paper contains 16 sections, 6 equations, 4 figures.

Figures (4)

  • Figure 1: Performance correlates with skin tone representation when training dermatological VAEs (left). However, the VAE's standard deviations do not detect this bias, illustrating why we need better UQ for generative models (right).
  • Figure 2: On Fitzpatrick17k, the performance on light and dark skin tones changes with their representation. The VAE encoder uncertainty is a poor indicator, while SLUG follows performance across groups and training scenarios.
  • Figure 3: On the external PASSION dataset, we see again how reduced MSE is flagged by increased SLUG uncertainty across dark skin tone groups.
  • Figure 4: Left: On Fitzpatrick17k (Dataset B -- Mixed), our SLUG score strongly correlates with the MSE. Center: On Fitzpatrick17k, removing samples with higher uncertainty results in consistent improvements. Right: On ISIC, the VAE reconstructs OOD data, but SLU detects the OOD content in dermatological images.