Table of Contents
Fetching ...

MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging

Xingyu Zhang, Anna Reithmeir, Fryderyk Kögl, Rickmer Braren, Julia A. Schnabel, Daniel M. Lang

TL;DR

The paper tackles 3D medical image correspondence by moving beyond local intensity-based registration. It introduces MedDIFT, a training-free pipeline that extracts multi-scale diffusion features from a pretrained 3D medical diffusion model (MAISI), fuses them into voxel descriptors, and matches voxels via cosine similarity (optionally using a local search). Results on lung CT demonstrate competitive accuracy with state-of-the-art learning-based methods and clear superiority over conventional B-spline registration, with ablations confirming the benefits of multi-scale fusion and moderate diffusion noise. The work suggests that diffusion-based semantic representations can effectively guide robust, lightweight voxel correspondence in 3D medical imaging, with potential extensions to multimodal tasks and integration into broader registration pipelines.

Abstract

Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.

MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging

TL;DR

The paper tackles 3D medical image correspondence by moving beyond local intensity-based registration. It introduces MedDIFT, a training-free pipeline that extracts multi-scale diffusion features from a pretrained 3D medical diffusion model (MAISI), fuses them into voxel descriptors, and matches voxels via cosine similarity (optionally using a local search). Results on lung CT demonstrate competitive accuracy with state-of-the-art learning-based methods and clear superiority over conventional B-spline registration, with ablations confirming the benefits of multi-scale fusion and moderate diffusion noise. The work suggests that diffusion-based semantic representations can effectively guide robust, lightweight voxel correspondence in 3D medical imaging, with potential extensions to multimodal tasks and integration into broader registration pipelines.

Abstract

Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.

Paper Structure

This paper contains 10 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1.1: MedDIFT derives diffusion features from the pretrained MAISI guo2025maisi model and fuses them into semantic feature descriptors.
  • Figure 1.2: Heatmap of mean keypoint error (in mm) across different decoder levels ($l$) and diffusion timesteps ($t$). Fusion of feature maps from multiple levels are indicated as, for instance, $012$ where levels $0$, $1$, and $2$ are combined.
  • Figure 1.3: Qualitative example. The projection onto the coronal slice of an example source keypoint (green) and the estimated (red) and ground truth correspondences (blue) are shown, together with the similarity score map for the respective slice.