Table of Contents
Fetching ...

MOSMOS: Multi-organ segmentation facilitated by medical report supervision

Weiwei Tian, Xinyu Huang, Junlin Hou, Caiyue Ren, Longquan Jiang, Rui-Wei Zhao, Gang Jin, Yuejie Zhang, Daoying Geng

TL;DR

A novel pre-training&fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS) is proposed, which introduces global contrastive learning to maximally align the medical image-report pairs in the pre-training stage and leverages multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.

Abstract

Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training & fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.

MOSMOS: Multi-organ segmentation facilitated by medical report supervision

TL;DR

A novel pre-training&fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS) is proposed, which introduces global contrastive learning to maximally align the medical image-report pairs in the pre-training stage and leverages multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.

Abstract

Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training & fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.
Paper Structure (33 sections, 19 equations, 7 figures, 6 tables)

This paper contains 33 sections, 19 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of our multi-organ segmentation result when using medical report supervision for pre-training. Left: The attention map locating the organ tags in the radiology image extracted from the corresponding medical report. Right: Our segmentation result for the corresponding organ tags.
  • Figure 2: Illustration of the 20 organ categories in the tag list. The tag size is proportional to the tag frequency in the training set of the ROCO dataset.
  • Figure 3: Illustration of our proposed MOSMOS framework in both pre-training and fine-tuning stages. (a) In the pre-training stage, MOSMOS applies image-report contrastive learning to align the global features of radiology images with those of corresponding medical reports. To further learn fine-grained visual representation from medical report supervision, the visual spatial features and the embeddings of the constructed $K$-class tags are sent to the Transformer decoder for multi-label recognition. Note that the ground-truth tags are extracted from the medical reports with no manual annotation. Note: $B_{1}$: batch size, $H_{1}$: height of the image, $W_{1}$: width of the image, $C_{1}$: dimension of the image, $\hat{H}_{1}$: height of the image embedding, $\hat{W}_{1}$: width of the image embedding, $C$: dimension of the image embedding, text embedding, and tag embedding, $K$: number of the organ tags, $N$: token length of the medical report, $N_{1}$: token length of the learnable textual context, $N_{2}$: token length of the tag, $C_{2}$: dimension of the medical report and tag. (b) In the fine-tuning stage, the pixel-tag attention maps calculated by the Transformer decoder are fed into the image decoder. The segmentation loss and the pixel-tag aligning loss are combined to supervise the training process. Note that the learnable textual context is shared across all tags and is continuously updated in both stages. Note: $B_{2}$: batch size, $H_{2}$: height of the image, $W_{2}$: width of the image, $D_{2}$: depth of the image, $\hat{H}_{2}$: height of the image embedding, $\hat{W}_{2}$: width of the image embedding, $\hat{D}_{2}$: depth of the image embedding, $Q$: number of the organ tags. For 2D images, $D_{2}$ and $\hat{D}_{2}$ are omitted.
  • Figure 4: The indication of Dice gap between UNETR (Blue) and MOSMOS (Green) on AMOS validation sets for CT (a) and MRI (b). Notably, Duo, Bla, and PoU belong to open-set organ categories. Note: Spl: spleen, RKid: right kidney, LKid: left kidney, Gal: gallbladder, Eso: esophagus, Liv: liver, Sto: stomach, Aor: aorta, IVC: inferior vena cava, Pan: pancreas, RAG: right adrenal gland, LAG: left adrenal gland, Duo: duodenum, Bla: bladder, PoU: prostate or uterus.
  • Figure 5: Dice box plots of our approach based on ResNet-50 (a) and ViT-B/16 (b) visual backbones for BTCV, when varying the weight $\lambda$ of pixel-tag aligning loss from 0.1 to 1.0.
  • ...and 2 more figures