Table of Contents
Fetching ...

ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration

Junyu Chen, Yufan He, Eric C. Frey, Ye Li, Yong Du

TL;DR

The paper tackles the limitation of conventional ConvNets in capturing long-range spatial relations for volumetric medical image registration. It introduces ViT-V-Net, a hybrid architecture that combines ConvNet feature encoding with a Vision Transformer operating on patch embeddings to model global context, followed by a V-Net–style decoder to produce a dense displacement field. Training relies on an unsupervised loss combining mean squared error and a diffusion regularizer to enforce smooth deformations, with long skip connections preserving localization details. Empirical results on brain MRI demonstrate substantial Dice improvements over state-of-the-art methods (SyN, NiftyReg, VoxelMorph), validating the effectiveness of ViT-based global context modeling for 3D DIR.

Abstract

In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.

ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration

TL;DR

The paper tackles the limitation of conventional ConvNets in capturing long-range spatial relations for volumetric medical image registration. It introduces ViT-V-Net, a hybrid architecture that combines ConvNet feature encoding with a Vision Transformer operating on patch embeddings to model global context, followed by a V-Net–style decoder to produce a dense displacement field. Training relies on an unsupervised loss combining mean squared error and a diffusion regularizer to enforce smooth deformations, with long skip connections preserving localization details. Empirical results on brain MRI demonstrate substantial Dice improvements over state-of-the-art methods (SyN, NiftyReg, VoxelMorph), validating the effectiveness of ViT-based global context modeling for 3D DIR.

Abstract

In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.

Paper Structure

This paper contains 7 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Method overview and network architecture of ViT-V-Net.
  • Figure 2: Registration results of a MR coronal slice. Additional results are shown in Appendix \ref{['add_res']}.
  • Figure 3: Model overview of the Vision Transformer.
  • Figure 4: Training loss value and validation Dice score per epoch. The proposed ViT-V-Net exhibits lower loss values and higher Dice scores during training.
  • Figure 5: Boxplots of Dice scores for various anatomical structures obtained using different registration methods. Dice scores of the left and right brain hemispheres were averaged into a single score. Orange triangles denote the means.
  • ...and 1 more figures