Table of Contents
Fetching ...

General Vision Encoder Features as Guidance in Medical Image Registration

Fryderyk Kögl, Anna Reithmeir, Vasiliki Sideri-Lampretsa, Ines Machado, Rickmer Braren, Daniel Rückert, Julia A. Schnabel, Veronika A. Zimmer

TL;DR

The paper investigates whether features from general vision encoders can serve as dissimilarity measures to guide deformable medical image registration. It benchmarks DINOv2, SAM, and MedSAM within a B-spline free-form deformation framework and tests two integration variants across a cardiac MRI dataset. Key findings show that incorporating feature-based distances as an auxiliary term improves registration quality, with MedSAM excelling in segmentation overlap and DINOv2 contributing to robust performance, especially when inputs are upscaled. The results suggest that task-agnostic vision encoders can enhance geometry-based registration without retraining on medical data, while pointing to future work on 3D volumes and exploring intermediate encoder layers.

Abstract

General vision encoders like DINOv2 and SAM have recently transformed computer vision. Even though they are trained on natural images, such encoder models have excelled in medical imaging, e.g., in classification, segmentation, and registration. However, no in-depth comparison of different state-of-the-art general vision encoders for medical registration is available. In this work, we investigate how well general vision encoder features can be used in the dissimilarity metrics for medical image registration. We explore two encoders that were trained on natural images as well as one that was fine-tuned on medical data. We apply the features within the well-established B-spline FFD registration framework. In extensive experiments on cardiac cine MRI data, we find that using features as additional guidance for conventional metrics improves the registration quality. The code is available at github.com/compai-lab/2024-miccai-koegl.

General Vision Encoder Features as Guidance in Medical Image Registration

TL;DR

The paper investigates whether features from general vision encoders can serve as dissimilarity measures to guide deformable medical image registration. It benchmarks DINOv2, SAM, and MedSAM within a B-spline free-form deformation framework and tests two integration variants across a cardiac MRI dataset. Key findings show that incorporating feature-based distances as an auxiliary term improves registration quality, with MedSAM excelling in segmentation overlap and DINOv2 contributing to robust performance, especially when inputs are upscaled. The results suggest that task-agnostic vision encoders can enhance geometry-based registration without retraining on medical data, while pointing to future work on 3D volumes and exploring intermediate encoder layers.

Abstract

General vision encoders like DINOv2 and SAM have recently transformed computer vision. Even though they are trained on natural images, such encoder models have excelled in medical imaging, e.g., in classification, segmentation, and registration. However, no in-depth comparison of different state-of-the-art general vision encoders for medical registration is available. In this work, we investigate how well general vision encoder features can be used in the dissimilarity metrics for medical image registration. We explore two encoders that were trained on natural images as well as one that was fine-tuned on medical data. We apply the features within the well-established B-spline FFD registration framework. In extensive experiments on cardiac cine MRI data, we find that using features as additional guidance for conventional metrics improves the registration quality. The code is available at github.com/compai-lab/2024-miccai-koegl.
Paper Structure (18 sections, 3 equations, 16 figures, 1 table)

This paper contains 18 sections, 3 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Overall framework: We explore how different pre-trained natural image encoders can be used within the registration objective function. The fixed image and the warped moving image are encoded with the frozen encoders, and the distance between the features is measured. This distance measure term is then used in the objective function of the iterative B-splines registration.
  • Figure 2: Qualitative comparison of the first 5 principal components of DINOv2, SAM, and MedSAM features for a test image. DINOv2 captures coarser features, while SAM focuses on edges and MedSAM on the underlying texture.
  • Figure 3: Comparison of the L1 and cosine dissimilarity measures on the features of the three encoder models for rotation and translation of one test image. For both rotation and translation tasks, the cosine dissimilarity curves appear smoother and with a wider capture range making it a better choice as a dissimilarity measure.
  • Figure 4: Quantitative comparison: Class-wise and mean DSC scores before and after registration. Methods using the same encoder share the same color with varying hues. The boxplots are grouped horizontally by class: right ventricular cavity (RV), left ventricular myocardium (LV-Mayo), left ventricular cavity (LV), and the mean over all classes. We see that variant 2 outperforms the baseline in both class-wise and mean DSC.
  • Figure 5: Qualitative comparison of the Jacobian determinant maps from variant 1 with the DINOv2 and SAM encoders, as well as the baseline. From left to right, we see that the Jacobian determinant map of DINOv2 has many local deformations in the background regions, while the one of SAM has fewer. The baseline transformation appears smoother outside of the heart. The fixed and moving images differ only in the heart, so we would expect most of the deformation to occur there. White areas show shrinkage, and blue areas expansion.
  • ...and 11 more figures