Table of Contents
Fetching ...

Decoder-Only Image Registration

Xi Jia, Wenqi Lu, Xinxing Cheng, Jinming Duan

TL;DR

This work argues that encoder learning is often unnecessary for unsupervised 3D medical image registration and introduces LessNet, a decoder-only network that predicts dense displacement fields from image pairs using handcrafted multi-scale pooling features. By eliminating the learnable encoder and leveraging a four-block decoder, LessNet achieves competitive Dice scores on brain MRI datasets (OASIS-1 and IXI) while dramatically reducing parameters, memory, and compute, including support for diffeomorphic variants via velocity-field exponentiation with scaling and squaring. The results show that a compact, decoder-centric design can match state-of-the-art performance with substantially lower resource requirements, though diffeomorphic advantages may be dataset-dependent. The paper highlights the potential of substituting learned encoders with handcrafted or pre-trained features to enable efficient, large-scale registration in practice.

Abstract

In unsupervised medical image registration, the predominant approaches involve the utilization of a encoder-decoder network architecture, allowing for precise prediction of dense, full-resolution displacement fields from given paired images. Despite its widespread use in the literature, we argue for the necessity of making both the encoder and decoder learnable in such an architecture. For this, we propose a novel network architecture, termed LessNet in this paper, which contains only a learnable decoder, while entirely omitting the utilization of a learnable encoder. LessNet substitutes the learnable encoder with simple, handcrafted features, eliminating the need to learn (optimize) network parameters in the encoder altogether. Consequently, this leads to a compact, efficient, and decoder-only architecture for 3D medical image registration. Evaluated on two publicly available brain MRI datasets, we demonstrate that our decoder-only LessNet can effectively and efficiently learn both dense displacement and diffeomorphic deformation fields in 3D. Furthermore, our decoder-only LessNet can achieve comparable registration performance to state-of-the-art methods such as VoxelMorph and TransMorph, while requiring significantly fewer computational resources. Our code and pre-trained models are available at https://github.com/xi-jia/LessNet.

Decoder-Only Image Registration

TL;DR

This work argues that encoder learning is often unnecessary for unsupervised 3D medical image registration and introduces LessNet, a decoder-only network that predicts dense displacement fields from image pairs using handcrafted multi-scale pooling features. By eliminating the learnable encoder and leveraging a four-block decoder, LessNet achieves competitive Dice scores on brain MRI datasets (OASIS-1 and IXI) while dramatically reducing parameters, memory, and compute, including support for diffeomorphic variants via velocity-field exponentiation with scaling and squaring. The results show that a compact, decoder-centric design can match state-of-the-art performance with substantially lower resource requirements, though diffeomorphic advantages may be dataset-dependent. The paper highlights the potential of substituting learned encoders with handcrafted or pre-trained features to enable efficient, large-scale registration in practice.

Abstract

In unsupervised medical image registration, the predominant approaches involve the utilization of a encoder-decoder network architecture, allowing for precise prediction of dense, full-resolution displacement fields from given paired images. Despite its widespread use in the literature, we argue for the necessity of making both the encoder and decoder learnable in such an architecture. For this, we propose a novel network architecture, termed LessNet in this paper, which contains only a learnable decoder, while entirely omitting the utilization of a learnable encoder. LessNet substitutes the learnable encoder with simple, handcrafted features, eliminating the need to learn (optimize) network parameters in the encoder altogether. Consequently, this leads to a compact, efficient, and decoder-only architecture for 3D medical image registration. Evaluated on two publicly available brain MRI datasets, we demonstrate that our decoder-only LessNet can effectively and efficiently learn both dense displacement and diffeomorphic deformation fields in 3D. Furthermore, our decoder-only LessNet can achieve comparable registration performance to state-of-the-art methods such as VoxelMorph and TransMorph, while requiring significantly fewer computational resources. Our code and pre-trained models are available at https://github.com/xi-jia/LessNet.
Paper Structure (20 sections, 4 equations, 5 figures, 7 tables)

This paper contains 20 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of registration performance, training time, memory usage, and mult-adds operations among different networks. In the main figure, $y$-axis represents registration accuracy measured by Dice on the validation set, while $x$-axis denotes GPU hours. Each network was trained for 500 epochs on a A100 GPU and our LessNet reached 500 epochs with the least amount of hours. In the subfigure, $y$-axis also represents registration accuracy, while $x$-axis denotes training memory usage measured in mebibytes (MiB). Within this subfigure, for each network, the area corresponds to the number of mult-adds required for one pair of images, each with a size of $160 \times 192 \times 224$. LessNet achieves superior registration accuracy with reduced training time, memory usage, and computational operations.
  • Figure 2: Semantic overview of different encoder-decoder architectures used in medical image registration. a) U-Net style network: In this architecture, input moving and fixed images are combined into a two-channel image (marked by black box) as input. The encoder (marked by blue box) and decoder (marked by red box) exhibit a symmetric layout, and features from the encoder are skip-connected to those in the decoder. b) Siamese-Net (Dual-Net): In this architecture, the input moving and fixed images are fed into two parallel encoders, separately. Next, the decoder combines these features from the two encoders and then maps them to a registration field. c) and d) Hybrid networks: In these architectures, some learnable layers are replaced by model-driven layers which are pre-defined, knowledge-driven parameter-free blocks. These hybrid approaches often lead to a reduced number of network parameters, fewer mult-adds, and thereby faster training and inference speeds. e) LessNet: Our network stands out by not having a learnable encoder at all. Instead, the decoder learns a full-resolution registration field directly from the input images.
  • Figure 3: The architecture of LessNet. The upper half panel demonstrates the generation of multi-scale pooling features, while the lower half panel showcases the input and output of the learnable decoder, which consists of four hierarchical convolutional blocks. The loss function is applied to the moving image, warped by the predicted displacement field, and the fixed image.
  • Figure 4: Comparison of registration performance qualitatively. From top to bottom (apart from 1st column) are warped images, displacement fields, deformation fields, and warped moving masks.
  • Figure 5: Comparison between 10 different methods on various metrics such as registration accuracy, GPU memory usage, etc. From left to right in each plot: VoxelMorph-1, VoxelMorph-2, Diff-VoxelMorph, TransMorph, Diff-TransMorph, Diff-B-Spline, Fourier-Net, Diff-Fourier-Net, LessNet$_4$, Diff-LessNet$_4$.