Medical diffusion on a budget: Textual Inversion for medical image generation

Bram de Wilde; Anindo Saha; Maarten de Rooij; Henkjan Huisman; Geert Litjens

Medical diffusion on a budget: Textual Inversion for medical image generation

Bram de Wilde, Anindo Saha, Maarten de Rooij, Henkjan Huisman, Geert Litjens

TL;DR

This work shows that Textual Inversion can adapt pre-trained diffusion models like Stable Diffusion to medical imaging using only 100 examples on a consumer GPU, addressing data scarcity and privacy concerns. By training compact embeddings for a medical concept, the approach enables diagnostically meaningful image generation across modalities, supports compositional prompting and inpainting, and can augment limited real data for downstream classification (e.g., improving Prostate MRI AUC from 0.78 to 0.80 with synthetic data). Compared to StyleGAN3 baselines, diffusion with TI yields superior perceptual realism in a blinded radiologist evaluation, while highlighting limitations of conventional metrics like FID/MFID in medical contexts. The results suggest a practical, low-resource pathway to generate synthetic medical images for training and scenario planning, with embeddings that are easy to share and integrate into existing workflows, though not a substitute for large-scale, captioned medical datasets.

Abstract

Diffusion models for text-to-image generation, known for their efficiency, accessibility, and quality, have gained popularity. While inference with these systems on consumer-grade GPUs is increasingly feasible, training from scratch requires large captioned datasets and significant computational resources. In medical image generation, the limited availability of large, publicly accessible datasets with text reports poses challenges due to legal and ethical concerns. This work shows that adapting pre-trained Stable Diffusion models to medical imaging modalities is achievable by training text embeddings using Textual Inversion. In this study, we experimented with small medical datasets (100 samples each from three modalities) and trained within hours to generate diagnostically accurate images, as judged by an expert radiologist. Experiments with Textual Inversion training and inference parameters reveal the necessity of larger embeddings and more examples in the medical domain. Classification experiments show an increase in diagnostic accuracy (AUC) for detecting prostate cancer on MRI, from 0.78 to 0.80. Further experiments demonstrate embedding flexibility through disease interpolation, combining pathologies, and inpainting for precise disease appearance control. The trained embeddings are compact (less than 1 MB), enabling easy data sharing with reduced privacy concerns.

Medical diffusion on a budget: Textual Inversion for medical image generation

TL;DR

Abstract

Paper Structure (22 sections, 10 figures, 2 tables)

This paper contains 22 sections, 10 figures, 2 tables.

Introduction
Related work
Methods
Image generation
Classification
Datasets
Multi-modal MRI - PI-CAI
Chest X-ray - CheXpert
Histopathology - PatchCamelyon
Experiments
Adapting TI parameters to medical imaging
Comparison to StyleGAN3
Classification with synthetic data
Composability of embeddings
Controlling disease appearance with inpainting
...and 7 more sections

Figures (10)

Figure 1: The Textual Inversion fine-tuning process for diffusion models trains a text conditioning embedding for a new token using a small set of images while keeping the rest of the architecture frozen. We show that this allows the adaption of latent diffusion models to a variety of medical imaging modalities, using only 100 examples and a single consumer-grade GPU.
Figure 2: Interpolation between a healthy and diseased state for multi-modal Prostate MRI. The column titles show the trade-off between healthy and diseased.
Figure 3: Visual example illustrating that multiple embeddings can be composed to show multiple pathologies in a single image. From left to right, pleural effusion, pneumonia, and cardiomegaly are progressively added to a healthy generated example.
Figure 4: Inpainting of prostate cancer in different locations on the same healthy generated Prostate MRI example. The top row shows the original healthy case, with the bottom rows showing inpainting in different locations with varying mask sizes.
Figure 5: Visual examples illustrating the effect of varying inference and training settings for T2-weighted prostate MRI, all generated using the same random seed. Columns with a bold title indicate optimal values. Row labels indicate the parameter that changes along the column, with bold values set for the other parameters. For example, in the top row, the number of steps changes, but the CFG scale, embedding size, and training cases are 2, 64, and 100, respectively.
...and 5 more figures

Medical diffusion on a budget: Textual Inversion for medical image generation

TL;DR

Abstract

Medical diffusion on a budget: Textual Inversion for medical image generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)