Table of Contents
Fetching ...

Spatially Covariant Image Registration with Text Prompts

Xiang Chen, Min Liu, Rongguang Wang, Renjiu Hu, Dongdong Liu, Gaolei Li, Hang Zhang

TL;DR

This work addresses the efficiency and accuracy gap in deformable medical image registration by introducing textSCF, a framework that combines spatially covariant filters with text-driven anatomical prompts encoded via CLIP. By mapping anatomical-region prompts to per-voxel filter weights through a three-branch architecture (text, mask, and feature branches), the method produces region-aware deformation fields that preserve discontinuities between organs while remaining computationally efficient. Empirical results on brain MRI (OASIS) and abdominal CT demonstrate state-of-the-art Dice scores and favorable smoothness (SDlogJ), with notable gains when incorporating external segmentation and semantic text embeddings; the approach also shows transferability across regions and architectures and scales down parameters with minimal accuracy loss. The work highlights the practical impact of combining visual-language priors with spatially covariant priors to improve registration in resource-constrained clinical settings, offering a pathway toward more robust, interpretable deformable registration in multi-organ contexts.

Abstract

Medical images are often characterized by their structured anatomical representations and spatially inhomogeneous contrasts. Leveraging anatomical priors in neural networks can greatly enhance their utility in resource-constrained clinical settings. Prior research has harnessed such information for image segmentation, yet progress in deformable image registration has been modest. Our work introduces textSCF, a novel method that integrates spatially covariant filters and textual anatomical prompts encoded by visual-language models, to fill this gap. This approach optimizes an implicit function that correlates text embeddings of anatomical regions to filter weights, relaxing the typical translation-invariance constraint of convolutional operations. TextSCF not only boosts computational efficiency but can also retain or improve registration accuracy. By capturing the contextual interplay between anatomical regions, it offers impressive inter-regional transferability and the ability to preserve structural discontinuities during registration. TextSCF's performance has been rigorously tested on inter-subject brain MRI and abdominal CT registration tasks, outperforming existing state-of-the-art models in the MICCAI Learn2Reg 2021 challenge and leading the leaderboard. In abdominal registrations, textSCF's larger model variant improved the Dice score by 11.3% over the second-best model, while its smaller variant maintained similar accuracy but with an 89.13% reduction in network parameters and a 98.34\% decrease in computational operations.

Spatially Covariant Image Registration with Text Prompts

TL;DR

This work addresses the efficiency and accuracy gap in deformable medical image registration by introducing textSCF, a framework that combines spatially covariant filters with text-driven anatomical prompts encoded via CLIP. By mapping anatomical-region prompts to per-voxel filter weights through a three-branch architecture (text, mask, and feature branches), the method produces region-aware deformation fields that preserve discontinuities between organs while remaining computationally efficient. Empirical results on brain MRI (OASIS) and abdominal CT demonstrate state-of-the-art Dice scores and favorable smoothness (SDlogJ), with notable gains when incorporating external segmentation and semantic text embeddings; the approach also shows transferability across regions and architectures and scales down parameters with minimal accuracy loss. The work highlights the practical impact of combining visual-language priors with spatially covariant priors to improve registration in resource-constrained clinical settings, offering a pathway toward more robust, interpretable deformable registration in multi-organ contexts.

Abstract

Medical images are often characterized by their structured anatomical representations and spatially inhomogeneous contrasts. Leveraging anatomical priors in neural networks can greatly enhance their utility in resource-constrained clinical settings. Prior research has harnessed such information for image segmentation, yet progress in deformable image registration has been modest. Our work introduces textSCF, a novel method that integrates spatially covariant filters and textual anatomical prompts encoded by visual-language models, to fill this gap. This approach optimizes an implicit function that correlates text embeddings of anatomical regions to filter weights, relaxing the typical translation-invariance constraint of convolutional operations. TextSCF not only boosts computational efficiency but can also retain or improve registration accuracy. By capturing the contextual interplay between anatomical regions, it offers impressive inter-regional transferability and the ability to preserve structural discontinuities during registration. TextSCF's performance has been rigorously tested on inter-subject brain MRI and abdominal CT registration tasks, outperforming existing state-of-the-art models in the MICCAI Learn2Reg 2021 challenge and leading the leaderboard. In abdominal registrations, textSCF's larger model variant improved the Dice score by 11.3% over the second-best model, while its smaller variant maintained similar accuracy but with an 89.13% reduction in network parameters and a 98.34\% decrease in computational operations.
Paper Structure (42 sections, 6 equations, 8 figures, 6 tables)

This paper contains 42 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Visual illustration of the overall framework for the proposed textSCF is provided. The process details are described in Section \ref{['sec:framework']}. Orange arrows indicate the loss functions (the details can be found in \ref{['sec:loss_function']}); note that the Dice loss is omitted in the figure for brevity. A lock symbol on a module means that its weights are frozen and not subject to training.
  • Figure 2: The figure demonstrates the structure of the encoder-decoder backbone network $f_{\xi}$ (left panel) and the architecture of the implicit function $\Phi_{\theta}$ (right panel). Feature tensors are represented by rectangles, each produced from its predecessor through a 3D convolutional layer. Arrows signify skip connections, merging tensors of the encoder and decoder, where $N_s$ indicates the channel count. The $\Phi_{\theta}$ function, realized by an MLP with three layers, maps text embeddings from dimension $C_1$ to successive intermediate dimensions $C_{\Phi}$ and $2C_{\Phi}$, culminating in the final dimension $C_2$.
  • Figure 3: Boxplots depicting the Dice scores for each anatomical structure on the Abdomen CT dataset. Included structures are the spleen, right kidney, left kidney, gall bladder, esophagus, liver, stomach, aorta, inferior vena cava, portal and splenic vein, pancreas, left adrenal gland, and right adrenal gland. Structures are arranged in order of their average Dice score achieved with textSCF.
  • Figure 4: Trade-off between smoothness and Dice (%). This plot shows the relationship between average Dice scores and the smoothness metric SDlogJ in brain and abdomen registrations. In both registrations, network variants vary by the application of a diffeomorphic integration layer dalca2018unsupervised (indicated by "Int" next to points) and adjustments in the global smoothness term coefficient $\lambda$.
  • Figure 5: (a) This graph depicts the correlation between the channel count $C_{\Phi}$ and the Dice score achieved by textSCF on the Abdomen dataset. Each data point corresponds to the Dice score attained at varying levels of $C_{\Phi}$, which is plotted on a logarithmic scale to highlight the incremental enhancements. (b) This plot illustrates the trade-off between average Dice (%) and computational complexity for the abdomen dataset. It compares network parameter size and multi-add operations (in G), with the x-axis on a logarithmic scale. Starting channel counts $N_s$ for textSCF, LapIRN, and LKU-Net increase from 8 to 32, correlating with left-to-right movement on the graph. Circle size indicates the parameter size of each network.
  • ...and 3 more figures