Table of Contents
Fetching ...

Vector Field Attention for Deformable Image Registration

Yihao Liu, Junyu Chen, Lianrui Zuo, Aaron Carass, Jerry L. Prince

TL;DR

This work introduces Vector Field Attention (VFA), a novel deformable image registration framework that directly retrieves pixel-level correspondences via a fixed yet differentiable attention mechanism. By separating feature extraction, local feature matching, and a fixed vector-field-based location retrieval, VFA achieves accurate multi-resolution registrations across intra- and inter-modality tasks and under unsupervised, weakly supervised, and semi-supervised settings. Across four public benchmarks, VFA consistently outperforms state-of-the-art methods while maintaining fast inference, illustrating the value of decoupling feature matching from deformation prediction. The approach offers a scalable, memory-aware path toward robust registrations and opens avenues for 4D registration and broader applications in computer vision tasks requiring precise location correspondences.

Abstract

Deformable image registration establishes non-linear spatial correspondences between fixed and moving images. Deep learning-based deformable registration methods have been widely studied in recent years due to their speed advantage over traditional algorithms as well as their better accuracy. Most existing deep learning-based methods require neural networks to encode location information in their feature maps and predict displacement or deformation fields though convolutional or fully connected layers from these high-dimensional feature maps. In this work, we present Vector Field Attention (VFA), a novel framework that enhances the efficiency of the existing network design by enabling direct retrieval of location correspondences. VFA uses neural networks to extract multi-resolution feature maps from the fixed and moving images and then retrieves pixel-level correspondences based on feature similarity. The retrieval is achieved with a novel attention module without the need of learnable parameters. VFA is trained end-to-end in either a supervised or unsupervised manner. We evaluated VFA for intra- and inter-modality registration and for unsupervised and semi-supervised registration using public datasets, and we also evaluated it on the Learn2Reg challenge. Experimental results demonstrate the superior performance of VFA compared to existing methods. The source code of VFA is publicly available at https://github.com/yihao6/vfa/.

Vector Field Attention for Deformable Image Registration

TL;DR

This work introduces Vector Field Attention (VFA), a novel deformable image registration framework that directly retrieves pixel-level correspondences via a fixed yet differentiable attention mechanism. By separating feature extraction, local feature matching, and a fixed vector-field-based location retrieval, VFA achieves accurate multi-resolution registrations across intra- and inter-modality tasks and under unsupervised, weakly supervised, and semi-supervised settings. Across four public benchmarks, VFA consistently outperforms state-of-the-art methods while maintaining fast inference, illustrating the value of decoupling feature matching from deformation prediction. The approach offers a scalable, memory-aware path toward robust registrations and opens avenues for 4D registration and broader applications in computer vision tasks requiring precise location correspondences.

Abstract

Deformable image registration establishes non-linear spatial correspondences between fixed and moving images. Deep learning-based deformable registration methods have been widely studied in recent years due to their speed advantage over traditional algorithms as well as their better accuracy. Most existing deep learning-based methods require neural networks to encode location information in their feature maps and predict displacement or deformation fields though convolutional or fully connected layers from these high-dimensional feature maps. In this work, we present Vector Field Attention (VFA), a novel framework that enhances the efficiency of the existing network design by enabling direct retrieval of location correspondences. VFA uses neural networks to extract multi-resolution feature maps from the fixed and moving images and then retrieves pixel-level correspondences based on feature similarity. The retrieval is achieved with a novel attention module without the need of learnable parameters. VFA is trained end-to-end in either a supervised or unsupervised manner. We evaluated VFA for intra- and inter-modality registration and for unsupervised and semi-supervised registration using public datasets, and we also evaluated it on the Learn2Reg challenge. Experimental results demonstrate the superior performance of VFA compared to existing methods. The source code of VFA is publicly available at https://github.com/yihao6/vfa/.
Paper Structure (15 sections, 5 equations, 9 figures, 3 tables)

This paper contains 15 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the supervised and unsupervised training schemes for deep learning based deformable registration algorithms.
  • Figure 2: (a) An overview of VFA and (b) the detailed network architecture of the U-shape feature extractor network. The superscripts for the feature maps and transformations are used to indicate different spatial resolutions. $I_f$, $I_m$ denote the fixed, moving images.
  • Figure 3: (a) A 2D illustration of the feature matching and location retrieval steps for a single location in fixed feature maps. (b) The 3D implementation of the feature matching and location retrieval steps using the specialized attention. The spatial dimensions are denoted as $H$, $W$, and $D$, respectively. The feature maps $F^i$ and $M^i$ are assumed to have $C$ channels.
  • Figure 4: Visualization of the multi-resolution transformations. We used four downsampling steps in the feature extraction; Therefore, there are four intermediate low resolution transformations. For visualization purposes, these transformations have been upsampled to match the spatial dimensions of the input images. Additionally, each transformation has been applied to the moving image to visualize their effect. We note that only displacements within the axial plane are visualized in the grid line representations. In practical application, our algorithm outputs only the final transformation, $\phi^1$, and its corresponding warped image.
  • Figure 5: Visualization of the results for the T1w atlas to subject registration (top) and the T2w to T1w registration (bottom). The minimum and maximum values of the colorbar are specified in units of pixels.
  • ...and 4 more figures