Table of Contents
Fetching ...

Mamba? Catch The Hype Or Rethink What Really Helps for Image Registration

Bailiang Jian, Jiazhen Pan, Morteza Ghahremani, Daniel Rueckert, Christian Wachinger, Benedikt Wiestler

TL;DR

The paper investigates whether adopting 'advanced' low-level blocks (e.g., Vision Transformers, Mamba, large-kernel CNNs) genuinely improves brain MRI deformable registration. Through a modular component analysis built on a Voxelmorph baseline, it contrasts low-level replacements with high-level registration-specific designs such as coarse-to-fine motion pyramids, dual-stream encoders, correlation layers, and iterative optimization. The results reveal that advanced blocks offer little or no improvement, while high-level designs yield modest gains (about $1.5\%$ Dice, up to $5\%$ in zero-shot LPBA), with Voxelmorph remaining highly competitive. The study argues for simpler, registration-aware architectures and novel evaluation metrics, releasing code to enable broader, fair comparisons across datasets and modalities.

Abstract

Our findings indicate that adopting "advanced" computational elements fails to significantly improve registration accuracy. Instead, well-established registration-specific designs offer fair improvements, enhancing results by a marginal 1.5\% over the baseline. Our findings emphasize the importance of rigorous, unbiased evaluation and contribution disentanglement of all low- and high-level registration components, rather than simply following the computer vision trends with "more advanced" computational blocks. We advocate for simpler yet effective solutions and novel evaluation metrics that go beyond conventional registration accuracy, warranting further research across diverse organs and modalities. The code is available at \url{https://github.com/BailiangJ/rethink-reg}.

Mamba? Catch The Hype Or Rethink What Really Helps for Image Registration

TL;DR

The paper investigates whether adopting 'advanced' low-level blocks (e.g., Vision Transformers, Mamba, large-kernel CNNs) genuinely improves brain MRI deformable registration. Through a modular component analysis built on a Voxelmorph baseline, it contrasts low-level replacements with high-level registration-specific designs such as coarse-to-fine motion pyramids, dual-stream encoders, correlation layers, and iterative optimization. The results reveal that advanced blocks offer little or no improvement, while high-level designs yield modest gains (about Dice, up to in zero-shot LPBA), with Voxelmorph remaining highly competitive. The study argues for simpler, registration-aware architectures and novel evaluation metrics, releasing code to enable broader, fair comparisons across datasets and modalities.

Abstract

Our findings indicate that adopting "advanced" computational elements fails to significantly improve registration accuracy. Instead, well-established registration-specific designs offer fair improvements, enhancing results by a marginal 1.5\% over the baseline. Our findings emphasize the importance of rigorous, unbiased evaluation and contribution disentanglement of all low- and high-level registration components, rather than simply following the computer vision trends with "more advanced" computational blocks. We advocate for simpler yet effective solutions and novel evaluation metrics that go beyond conventional registration accuracy, warranting further research across diverse organs and modalities. The code is available at \url{https://github.com/BailiangJ/rethink-reg}.
Paper Structure (23 sections, 2 equations, 6 figures, 5 tables)

This paper contains 23 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the baseline and the modularized components. (a) The upper left shows the U-Net-based Baseline methods, which concatenates the source and target images and predicts the final deformation field directly. (b) The bottom left shows the Dual-stream encoder variant. (c) The right figure presents the detailed workflow of each registration-specific element (Pyramid, Warping, Correlation and Iteration) at the specific level $\ell$. Level $\ell$ corresponds to $2^{-\ell}$ resolution. The width of the cubes in the right diagram does not correspond to the exact number of channels of the features, but only for display convenience.
  • Figure 2: The sagittal-viewed visualization results of a randomly sampled pair from the LPBA dataset. The first column shows the target label map and image, and the source image and label map from top to bottom. The other columns correspond to the methods to be compared. The first and second rows display the warped source (moved) label map and image by respective methods. The bottom right number shown in the moved label map is the mean dice score (DSC) of the volume (not the slice). The third row depicts the subtraction map (error map) between the target and moved images. The value is within the range [-1,1] since the image intensities are normalized to [0,1]. The last row shows the warped grid of the deformation field. The two numbers shown are the mean foreground displacement magnitude and the percentage of foreground non-diffeomorphic voxels, both statistics are computed for the entire image volume.
  • Figure 3: The axial-viewed visualization result of a randomly sampled pair from the Mindboggle dataset. The first column shows the target label map and image, and the source image and label map from top to bottom. The other columns correspond to the methods to be compared. The first and second rows show the warped source (moved) label map and image by respective methods. The bottom right number shown in the moved label map is the mean dice score (DSC) of the volume (not the slice). The third row depicts the subtraction map (error map) between the target and moved images. The value is within the range [-1,1] since the image intensities are normalized to [0,1]. The last row shows the warped grid of the deformation field. The two numbers shown are the mean foreground displacement magnitude and the percentage of foreground non-diffeomorphic voxels, both statistics are computed for the entire image volume.
  • Figure 4: Figure S1: The sagittal-viewed visualization result of a randomly sampled pair from the OASIS dataset. The first column shows the target label map and image, and the source image and label map from top to bottom. The other columns correspond to the methods to be compared. The first and second rows show the warped source (moved) label map and image by the methods. The bottom right number shown in the moved label map is the mean dice score (DSC) of the volume (not the slice). The third row depicts the subtraction map (error map) between the target and the warped source image. The value is within the range [-1,1] since the image intensities are normalized to [0,1]. The last row shows the warped grid of the deformation field. The two numbers shown are the mean foreground displacement magnitude and the percentage of foreground non-diffeomorphic voxels, both statistics are computed in the whole image volume.
  • Figure 5: Figure S2: The axial-viewed visualization result of a randomly sampled pair from the ADNI dataset. The first column shows the target label map and image, and the source image and label map from top to bottom. The other columns correspond to the methods to be compared. The first and second rows show the warped source (moved) label map and image by the methods. The bottom right number shown in the moved label map is the mean dice score (DSC) of the volume (not the slice). The third row depicts the subtraction map (error map) between the target and the warped source image. The value is within the range [-1,1] since the image intensities are normalized to [0,1]. The last row shows the warped grid of the deformation field. The two numbers shown are the mean foreground displacement magnitude and the percentage of foreground non-diffeomorphic voxels, both statistics are computed in the whole image volume.
  • ...and 1 more figures