DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Guillermo Jimenez-Perez; Pedro Osorio; Josef Cersovsky; Javier Montalt-Tordera; Jens Hooge; Steffen Vogler; Sadegh Mohammadi

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Guillermo Jimenez-Perez, Pedro Osorio, Josef Cersovsky, Javier Montalt-Tordera, Jens Hooge, Steffen Vogler, Sadegh Mohammadi

TL;DR

This work addresses the data annotation bottleneck in medical diffusion models by introducing DiNO-Diffusion, which conditions latent diffusion models on DiNO-derived image embeddings instead of text. Trained on a large, unlabeled chest X-ray corpus, the approach demonstrates robust image generation quality, effective data augmentation (up to ~20% AUC gains in small-data scenarios), and the viability of full synthetic training for privacy-preserving data sharing. It also achieves zero-shot segmentation with high Dice scores (up to 84.4%), illustrating strong anatomical alignment without task-specific labels. The method is architecture-agnostic and extendable to other modalities, paving the way for large-scale, multi-domain medical image generation alongside downstream AI model training with limited real data.

Abstract

Diffusion models (DMs) have emerged as powerful foundation models for a variety of tasks, with a large focus in synthetic image generation. However, their requirement of large annotated datasets for training limits their applicability in medical imaging, where datasets are typically smaller and sparsely annotated. We introduce DiNO-Diffusion, a self-supervised method for training latent diffusion models (LDMs) that conditions the generation process on image embeddings extracted from DiNO. By eliminating the reliance on annotations, our training leverages over 868k unlabelled images from public chest X-Ray (CXR) datasets. Despite being self-supervised, DiNO-Diffusion shows comprehensive manifold coverage, with FID scores as low as 4.7, and emerging properties when evaluated in downstream tasks. It can be used to generate semantically-diverse synthetic datasets even from small data pools, demonstrating up to 20% AUC increase in classification performance when used for data augmentation. Images were generated with different sampling strategies over the DiNO embedding manifold and using real images as a starting point. Results suggest, DiNO-Diffusion could facilitate the creation of large datasets for flexible training of downstream AI models from limited amount of real data, while also holding potential for privacy preservation. Additionally, DiNO-Diffusion demonstrates zero-shot segmentation performance of up to 84.4% Dice score when evaluating lung lobe segmentation. This evidences good CXR image-anatomy alignment, akin to segmenting using textual descriptors on vanilla DMs. Finally, DiNO-Diffusion can be easily adapted to other medical imaging modalities or state-of-the-art diffusion models, opening the door for large-scale, multi-domain image generation pipelines for medical imaging.

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 5 figures, 2 tables)

This paper contains 20 sections, 1 equation, 5 figures, 2 tables.

Introduction
Methods
Data
Generative Architecture - Stable Diffusion
Self-Supervised Conditioning
Reconstruction-based image generation
Interpolation-based image generation
Evaluation
Image Quality & Checkpoint Selection
Data Augmentation
Full Synthetic Training
Zero-Shot Segmentation
Experimental Setup
Results
Image Quality & Checkpoint Selection
...and 5 more sections

Figures (5)

Figure 1: Training and evaluation protocol. (a) DiNO-Diffusion training pipeline: the training image is both embedded into latents $z_0$ with a frozen () VAE, and processed by a frozen image encoder to generate global tokens that act as condition $c_{GLB}$. Then, the latents are noised at timestep $z_t$ and fed along the condition to the UNet, which denoises the latent $\hat{z}_0$. Finally, the loss $\textrm{L}_{LDM}$ is computed between $z_0$ and $\hat{z}_0$. (b) Evaluation protocols: the trained UNet is used to produce: (b-i) "reconstructions" of a given image; (b-ii) "interpolated" synthetic images from the embeddings of a source ($c_s$) and a target ($c_t$) real images at interpolation fraction $r$; or (b-iii) segmentation masks, by iteratively merging latent attention maps.
Figure 2: Examples of generated images with DiNO-Diffusion. In the reconstruction experiment (a), each row represents randomly generated examples from two base images within MIMIC and for both DiNOv1-Diffusion and DiNOv2-Diffusion, showing semantic variability. In the interpolation experiment (b), each row depicts two real images and the result from generating synthetic images by interpolating the embeddings incrementally for the DiNOv1-Diffusion (b-top) and DiNOv2-Diffusion (b-bottom) settings.
Figure 3: FID scores for both DiNO-Diffusion models, computed every 2500 steps over a subset of MIMIC. Lower is better.
Figure 4: Boxplots for the Data Augmentation (a) and the Full Synthetic Training (b) experiments, representing performance improvement when adding synthetic data in different data regimes relative to using real data only. The horizontal line represents a 0% improvement over the mean (red dot) classification performance when using real-data only (green bars) for each data regime and Real-to-Synthetic ($rs$) ratio independently. Therefore, values above the dotted line represent performance improvement and values below, performance degradation. The vertical lines separate the different data regimes for easier comparison, where the performance of DiNOv1-Diffusion (yellow palette) and DiNOv2-Diffusion (blue palette) are jointly displayed. In (i), the results for the reconstruction experiment are explored, whereas (ii) depicts the results for the interpolation experiment. Asterisks (*) represent statistical significance relative to real baseline ($p < 0.05$).
Figure 5: (a) Example segmentation masks generated by the best DiNOv1-Diffusion model and (b) common failure cases. Failures are caused by sub-optimal hyperparameters: (1) incomplete segmentation, often observed in early checkpoints or high thresholds; (2) oversegmentation and fragmentation, usually due to low merge thresholds; (3) bubble-like artifacts, mostly observed in later checkpoints.

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

TL;DR

Abstract

DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)