Table of Contents
Fetching ...

EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration

Abu Zahid Bin Aziz, Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Y. Elhabian

TL;DR

EfficientMorph introduces a parameter-efficient transformer-based framework for unsupervised 3D image registration by integrating a plane-based attention mechanism and Hi-Resolution tokenization. The plane attention captures local and global context along coronal, sagittal, and axial planes, while token merging preserves high-resolution details without exponential growth in computation. A multi-resolution variant further enhances accuracy by fusing latent features from different patch scales without training multiple separate models. Across OASIS, Remind2Reg, and IXI, EfficientMorph achieves competitive or superior Dice scores with 16–27× fewer parameters and faster convergence, enabling practical deployment on resource-limited end-user devices.

Abstract

Transformers have emerged as the state-of-the-art architecture in medical image registration, outperforming convolutional neural networks (CNNs) by addressing their limited receptive fields and overcoming gradient instability in deeper models. Despite their success, transformer-based models require substantial resources for training, including data, memory, and computational power, which may restrict their applicability for end users with limited resources. In particular, existing transformer-based 3D image registration architectures face two critical gaps that challenge their efficiency and effectiveness. Firstly, although window-based attention mechanisms reduce the quadratic complexity of full attention by focusing on local regions, they often struggle to effectively integrate both local and global information. Secondly, the granularity of tokenization, a crucial factor in registration accuracy, presents a performance trade-off: smaller voxel-size tokens enhance detail capture but come with increased computational complexity, higher memory usage, and a greater risk of overfitting. We present \name, a transformer-based architecture for unsupervised 3D image registration that balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations, thus capturing finer details without compromising computational efficiency. Notably, \name sets a new benchmark for performance on the OASIS dataset with 16-27x fewer parameters. https://github.com/MedVIC-Lab/Efficient_Morph_Registration

EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration

TL;DR

EfficientMorph introduces a parameter-efficient transformer-based framework for unsupervised 3D image registration by integrating a plane-based attention mechanism and Hi-Resolution tokenization. The plane attention captures local and global context along coronal, sagittal, and axial planes, while token merging preserves high-resolution details without exponential growth in computation. A multi-resolution variant further enhances accuracy by fusing latent features from different patch scales without training multiple separate models. Across OASIS, Remind2Reg, and IXI, EfficientMorph achieves competitive or superior Dice scores with 16–27× fewer parameters and faster convergence, enabling practical deployment on resource-limited end-user devices.

Abstract

Transformers have emerged as the state-of-the-art architecture in medical image registration, outperforming convolutional neural networks (CNNs) by addressing their limited receptive fields and overcoming gradient instability in deeper models. Despite their success, transformer-based models require substantial resources for training, including data, memory, and computational power, which may restrict their applicability for end users with limited resources. In particular, existing transformer-based 3D image registration architectures face two critical gaps that challenge their efficiency and effectiveness. Firstly, although window-based attention mechanisms reduce the quadratic complexity of full attention by focusing on local regions, they often struggle to effectively integrate both local and global information. Secondly, the granularity of tokenization, a crucial factor in registration accuracy, presents a performance trade-off: smaller voxel-size tokens enhance detail capture but come with increased computational complexity, higher memory usage, and a greater risk of overfitting. We present \name, a transformer-based architecture for unsupervised 3D image registration that balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations, thus capturing finer details without compromising computational efficiency. Notably, \name sets a new benchmark for performance on the OASIS dataset with 16-27x fewer parameters. https://github.com/MedVIC-Lab/Efficient_Morph_Registration
Paper Structure (25 sections, 3 equations, 9 figures, 9 tables)

This paper contains 25 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Parameter Count Comparisons with performance on OASIS Dataset. The proposed variants are formatted as EfficientMorph-11-stride-$C$ and EfficientMorph-23-stride-$C$. Comparison of parameter count in millions(M) and Dice scores between the proposed variants and baselines.
  • Figure 2: EfficientMorph Architecture. (A) EfficientMorph utilizes utilizes plane attention mechanism on the whole volume as shown in Efficient Transformer Block. We use different numbers and types of plane attentions ($xy, yz$, or $zx$ planes) for each block in the transformer backbone (Table \ref{['tab:EfficientMorph_Variations']}). Hi-Res Tokenization is shown in the left end of the figure. (B) Shows the architectural modification for multi-resolution variant where $S^{'}_{1} > S^{'}_{2}$.
  • Figure 3: Convergence Curves. The proposed variants are formatted as EfficientMorph-11-stride-$C$ and EfficientMorph-23-stride-$C$. Dice score curves of EfficientMorph variants as a function of epochs.
  • Figure 4: Impact of Annotated Segmentation Available for Training. These models were trained for EM-23 variant with stride 4x4x4 and embedding dimension 96.
  • Figure 5: OASIS qualitative results. Comparison among the best, median, and worst output of TransMorph with the variants of the proposed method. Here, EfficientMorph-23 and EfficientMorph-11 are the different variants with 2x2x2 stride size and 96 embedded dimension; CGA means variants with cascaded group attention.
  • ...and 4 more figures