Table of Contents
Fetching ...

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

Xiaoshuang Huang, Hongxiang Li, Meng Cao, Long Chen, Chenyu You, Dong An

TL;DR

RecLMIS tackles misalignment in language-guided medical image segmentation by explicitly modeling cross-modal interactions through conditioned reconstruction. The method introduces a Conditioned Interaction module and two reconstruction branches (CVR and CLR) to align visual and textual features, enabling fine-grained cross-modal understanding and efficient inference. It uses a conditioned contrastive loss to further tighten cross-modal representations and achieves state-of-the-art performance on QaTa-COV19 and MosMedData+ with substantial reductions in parameters and FLOPs. The work demonstrates practical impact for reliable language-guided MIS with faster inference, and code will be released.

Abstract

Recent developments underscore the potential of textual information in enhancing learning models for a deeper understanding of medical visual semantics. However, language-guided medical image segmentation still faces a challenging issue. Previous works employ implicit and ambiguous architectures to embed textual information. This leads to segmentation results that are inconsistent with the semantics represented by the language, sometimes even diverging significantly. To this end, we propose a novel cross-modal conditioned Reconstruction for Language-guided Medical Image Segmentation (RecLMIS) to explicitly capture cross-modal interactions, which assumes that well-aligned medical visual features and medical notes can effectively reconstruct each other. We introduce conditioned interaction to adaptively predict patches and words of interest. Subsequently, they are utilized as conditioning factors for mutual reconstruction to align with regions described in the medical notes. Extensive experiments demonstrate the superiority of our RecLMIS, surpassing LViT by 3.74% mIoU on the publicly available MosMedData+ dataset and achieving an average increase of 1.89% mIoU for cross-domain tests on our QATA-CoV19 dataset. Simultaneously, we achieve a relative reduction of 20.2% in parameter count and a 55.5% decrease in computational load. The code will be available at https://github.com/ShashankHuang/RecLMIS.

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

TL;DR

RecLMIS tackles misalignment in language-guided medical image segmentation by explicitly modeling cross-modal interactions through conditioned reconstruction. The method introduces a Conditioned Interaction module and two reconstruction branches (CVR and CLR) to align visual and textual features, enabling fine-grained cross-modal understanding and efficient inference. It uses a conditioned contrastive loss to further tighten cross-modal representations and achieves state-of-the-art performance on QaTa-COV19 and MosMedData+ with substantial reductions in parameters and FLOPs. The work demonstrates practical impact for reliable language-guided MIS with faster inference, and code will be released.

Abstract

Recent developments underscore the potential of textual information in enhancing learning models for a deeper understanding of medical visual semantics. However, language-guided medical image segmentation still faces a challenging issue. Previous works employ implicit and ambiguous architectures to embed textual information. This leads to segmentation results that are inconsistent with the semantics represented by the language, sometimes even diverging significantly. To this end, we propose a novel cross-modal conditioned Reconstruction for Language-guided Medical Image Segmentation (RecLMIS) to explicitly capture cross-modal interactions, which assumes that well-aligned medical visual features and medical notes can effectively reconstruct each other. We introduce conditioned interaction to adaptively predict patches and words of interest. Subsequently, they are utilized as conditioning factors for mutual reconstruction to align with regions described in the medical notes. Extensive experiments demonstrate the superiority of our RecLMIS, surpassing LViT by 3.74% mIoU on the publicly available MosMedData+ dataset and achieving an average increase of 1.89% mIoU for cross-domain tests on our QATA-CoV19 dataset. Simultaneously, we achieve a relative reduction of 20.2% in parameter count and a 55.5% decrease in computational load. The code will be available at https://github.com/ShashankHuang/RecLMIS.
Paper Structure (32 sections, 13 equations, 14 figures, 5 tables)

This paper contains 32 sections, 13 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: (a) Existing methods (e.g., LViTli2023lvit) face the issue of not fully and effectively adhering to the correct text prompts. For example, the text 'all left' is not fully reflected in the attention maps in LViT li2023lvit. Compared to that, the RecLMIS we proposed can focus on regions that match the text prompts properly. (b) Comparison with state-of-the-art methods on MosMedData+ morozov2020mosmeddatahofmanninger2020automaticli2023lvit dataset on mIoU (y-axis), parameter count (size of the area), and FLOPs (x-axis). The RecLMIS we proposed is superior in both performance and FLOPs.
  • Figure 2: A comparative analysis of different model architectures for mask modeling and Language-guided Medical Image Segmentation (LMIS). The orange patch represents the masked region. The 'V-Encoder', 'T-Encoder', 'V-Decoder', and 'T-Decoder', represent vision encoder, text encoder, vision decoder, and text decoder, respectively. 'Attn' and 'MSE' stand for 'Attention' and Mean Squared Error, respectively. (a) The MAE and BERT series mask modeling architectures involve masking the original image/sentence and subsequently using the vision decoder V-Decoder/T-Decoder for reconstruction. For achieving referring image segmentation, fine-tuning downstream task datasets is necessary. This approach encompasses methods such as chen2022multiwang2023swinmmdevlin2018bert. (b) The parallel U-shape architecture (LViT li2023lvit, LAVT yang2022lavt integrates an additional parallel U-shape structure to the original segmentation model for processing and fusing text, aiming to align visual and textual features. (c) The dual-branch fusion architecture liu2023multihu2023beyond employs N layers of cross-attention in the text branch to align visual and textual features, minimizing the Mean Squared Error (MSE) between post-interaction and pre-interaction textual features. (d) Our cross-modal conditional reconstruction fusion architecture proposes using textual/visual features and conditions to reconstruct current visual/textual features during training, enabling the alignment of visual and textual features with just a single layer of cross-attention.
  • Figure 3: Overview of the proposed cross-modal conditioned Reconstruction for Language-guided Medical Image Segmentation (RecLMIS). Given a pair of medical images and notes as prompts, we first exploit the visual encoder and text encoder to extract image and text features respectively. The Conditioned Interaction (Sec. \ref{['sec: ci']}) module is designed to align features from both vision and language inputs. The conditioned vision-language reconstruction encompasses a Conditioned Vision Reconstruction module (CVR, Sec. \ref{['sec:CVR']}) and a Conditioned Language Reconstruction module (CLR, Sec. \ref{['sec:CLR']}), serving the purposes of medical image reconstruction and language feature reconstruction, respectively. Finally, a lightweight mask predictor is employed to segment the designated region indicated by the text prompts.
  • Figure 4: The structure of the Conditioned Interaction module.
  • Figure 5: The structure of (a) Conditioned Vision Reconstruction module and (b) Conditioned Language Reconstruction.
  • ...and 9 more figures