Table of Contents
Fetching ...

Tell2Reg: Establishing spatial correspondence between images by the same language prompts

Wen Yan, Qianye Yang, Shiqi Huang, Yipei Wang, Shonit Punwani, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt

TL;DR

Tell2Reg reframes image registration as region-level correspondence detection by prompting identical language descriptions to fixed and moving images, leveraging pre-trained SAM and GroundingDINO to extract ROIs without retraining. The training-free approach demonstrates competitive performance against unsupervised registration methods on inter-subject prostate mpMR data, with Dice and TRE metrics approaching those of weakly supervised approaches. It also provides qualitative evidence that language semantics can relate to spatial ROI correspondences, including invariances and differences across global versus local regions. Future work will explore prostate-specific multimodal foundation models, automated prompt engineering, and refinement strategies to enhance robustness, efficiency, and generalization.

Abstract

Spatial correspondence can be represented by pairs of segmented regions, such that the image registration networks aim to segment corresponding regions rather than predicting displacement fields or transformation parameters. In this work, we show that such a corresponding region pair can be predicted by the same language prompt on two different images using the pre-trained large multimodal models based on GroundingDINO and SAM. This enables a fully automated and training-free registration algorithm, potentially generalisable to a wide range of image registration tasks. In this paper, we present experimental results using one of the challenging tasks, registering inter-subject prostate MR images, which involves both highly variable intensity and morphology between patients. Tell2Reg is training-free, eliminating the need for costly and time-consuming data curation and labelling that was previously required for this registration task. This approach outperforms unsupervised learning-based registration methods tested, and has a performance comparable to weakly-supervised methods. Additional qualitative results are also presented to suggest that, for the first time, there is a potential correlation between language semantics and spatial correspondence, including the spatial invariance in language-prompted regions and the difference in language prompts between the obtained local and global correspondences. Code is available at https://github.com/yanwenCi/Tell2Reg.git.

Tell2Reg: Establishing spatial correspondence between images by the same language prompts

TL;DR

Tell2Reg reframes image registration as region-level correspondence detection by prompting identical language descriptions to fixed and moving images, leveraging pre-trained SAM and GroundingDINO to extract ROIs without retraining. The training-free approach demonstrates competitive performance against unsupervised registration methods on inter-subject prostate mpMR data, with Dice and TRE metrics approaching those of weakly supervised approaches. It also provides qualitative evidence that language semantics can relate to spatial ROI correspondences, including invariances and differences across global versus local regions. Future work will explore prostate-specific multimodal foundation models, automated prompt engineering, and refinement strategies to enhance robustness, efficiency, and generalization.

Abstract

Spatial correspondence can be represented by pairs of segmented regions, such that the image registration networks aim to segment corresponding regions rather than predicting displacement fields or transformation parameters. In this work, we show that such a corresponding region pair can be predicted by the same language prompt on two different images using the pre-trained large multimodal models based on GroundingDINO and SAM. This enables a fully automated and training-free registration algorithm, potentially generalisable to a wide range of image registration tasks. In this paper, we present experimental results using one of the challenging tasks, registering inter-subject prostate MR images, which involves both highly variable intensity and morphology between patients. Tell2Reg is training-free, eliminating the need for costly and time-consuming data curation and labelling that was previously required for this registration task. This approach outperforms unsupervised learning-based registration methods tested, and has a performance comparable to weakly-supervised methods. Additional qualitative results are also presented to suggest that, for the first time, there is a potential correlation between language semantics and spatial correspondence, including the spatial invariance in language-prompted regions and the difference in language prompts between the obtained local and global correspondences. Code is available at https://github.com/yanwenCi/Tell2Reg.git.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: A brief illustration of how Tell2Reg framework uses text prompts to generate corresponding ROIs from fixed and moving images.
  • Figure 2: Tell2Reg framework using pre-trained GroundingDINO and SAM. The ROI-to-dense transformation is an optional step for comparing with other methods.
  • Figure 3: Visualization of registration results. The bounding boxes were generated by GroundingDINO, and the coloured masks were produced by SAM. The binary masks highlight the selected corresponding regions of interest (ROIs) in the fixed and moving images alongside the warped masks to show the optional spatial alignment results. The first two groups show corresponded ROIs, and the third group shows bad example of mismatched ROIs.