Table of Contents
Fetching ...

Encoder-Only Image Registration

Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yu, Min Liu, Yaonan Wang, Hang Zhang

TL;DR

EOIR introduces an encoder-only image registration framework that decouples feature learning from flow estimation to improve accuracy-efficiency in large-deformation scenarios. Guided by Horn–Schunck optical flow and a linearization-harmonization principle, EOIR uses a lightweight 3-layer encoder and a multi-level Hadamard-based flow estimator within a Laplacian feature pyramid, with deformation fields composed across levels to maintain diffeomorphism. The method achieves state-of-the-art efficiency-accuracy and accuracy-smoothness trade-offs across six diverse datasets, while remaining highly scalable and suitable for large-scale deployment; it also demonstrates strong zero-shot generalization and competitive performance on 2D multimodal tasks. Limitations include reduced multi-modal performance with the lightweight encoder and challenges with very small structures, suggesting future work to incorporate stronger encoders or priors. Overall, EOIR provides a robust, memory-efficient backbone for diffeomorphic registration that can scale to large volumetric datasets and resource-constrained environments.

Abstract

Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR's effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.

Encoder-Only Image Registration

TL;DR

EOIR introduces an encoder-only image registration framework that decouples feature learning from flow estimation to improve accuracy-efficiency in large-deformation scenarios. Guided by Horn–Schunck optical flow and a linearization-harmonization principle, EOIR uses a lightweight 3-layer encoder and a multi-level Hadamard-based flow estimator within a Laplacian feature pyramid, with deformation fields composed across levels to maintain diffeomorphism. The method achieves state-of-the-art efficiency-accuracy and accuracy-smoothness trade-offs across six diverse datasets, while remaining highly scalable and suitable for large-scale deployment; it also demonstrates strong zero-shot generalization and competitive performance on 2D multimodal tasks. Limitations include reduced multi-modal performance with the lightweight encoder and challenges with very small structures, suggesting future work to incorporate stronger encoders or priors. Overall, EOIR provides a robust, memory-efficient backbone for diffeomorphic registration that can scale to large volumetric datasets and resource-constrained environments.

Abstract

Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR's effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.

Paper Structure

This paper contains 58 sections, 10 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Visual demonstration of local intensity linearization. The top row shows synthetic and real-world images, while the bottom row presents corresponding heatmaps (values 0 to 1) in the 'viridis' color map, where brighter areas indicate better linearization (heatmap generation detailed in the appendix). The first three columns show synthetic examples: a binary square (value 0 and 1) and its Gaussian-blurred versions with $\sigma=1$ and $\sigma=3$. The last two columns display abdominal CT examples, with heatmaps derived from feature maps of untrained and trained ConvNets. Both Gaussian filtering and trained neural networks enhance local intensity linearization.
  • Figure 2: Architecture of the EOIR framework. The three-level pyramid operates as follows: (1) Features $I_m^{(l)}$ and $F_m^{(l)}$ are independently extracted (via encoder) and downsampled. (2) Deformation fields $\phi_1$–$\phi_3$ are estimated per level via flow estimators. (3) Deformations are composed across levels. This process breaks large deformations into a sequence of small, H–S-compliant residual steps, enabling robust registration. See §\ref{['sec:eoir_arch']} for details.
  • Figure 3: A one-dimensional analogue of Eq. \ref{['eq:horn_schunck']} holds at $x_a$, where $\frac{\mathbf{I}_m(x_a) - \mathbf{I}_f(x_a)}{\mathbf{u}(x_a)} \approx \frac{d\mathbf{I}_f}{dx}(x_a)$. Here, $\mathbf{I}_m(x)$ is generated by translating $\mathbf{I}_f(x)$ horizontally by $-1$ and adding a global bias of $-0.2$ for $x > 1$. However, the displacement cannot be determined at $x_b$ (where $\mathbf{I}_m(x_b) - \mathbf{I}_f(x_b) \approx 0$), between $x=2$ and $x_d$ (where a global bias is applied), or between $x_d$ and $x_e$ (where $\|\nabla \mathbf{I}_f\| = \|\nabla \mathbf{I}_m\| = 0$). Additional constraints are required to propagate displacements from surrounding regions.
  • Figure 4: Visual comparison of the trade-off between avg. Dice and computational complexity for varying numbers of conv layers in the EOIR encoder ($n_c$ from 0 to 6), alongside top-performing pyramid methods RDP wang_RDP and MemWarp zhang2024memwarp on the abdomen dataset. Circle size and labels indicate network parameter size, and multi-adds (G) are plotted on a logarithmic x-axis. (see appendix for further metric details). This comparison highlights the effects of our M5.
  • Figure 5: Visual illustration of the components of the encoder and flow estimator in EOIR. To illustrate the encoder structure, we use a three-level feature pyramid, which consists of three Conv-Norm-Act blocks and two trilinear downsampling layers, producing three pairs of moving and fixed images at different scales. Each pyramid level's flow estimator shares the same structure but with different weights; it consists of a Hadamard transformation, three Conv-Norm-Act blocks, and a single convolution to produce a residual displacement field at that level. In our experiments, we empirically set $K_s=1$.
  • ...and 10 more figures