Table of Contents
Fetching ...

Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

Mansi Kakkar, Dattesh Shanbhag, Chandan Aladahalli, Gurunath Reddy M

TL;DR

This work addresses the problem of automated, whole-body, multi-modal anatomy labeling in MR and CT radiology images by fine-tuning PubMedCLIP on a curated multi-modal dataset of organs and body stations. It introduces image and language augmentations, including diverse prompts that encode modality, orientation, station, and organ, and uses a balanced loss to jointly optimize vision- and text-based predictions. The proposed PMC-MSA model, which combines enhanced data, text prompt diversity, and joint augmentations, achieves a 47.6% average improvement in organ detection and a 27% improvement in station detection over the PubMedCLIP baseline, demonstrating improved cross-modal anatomical understanding in clinical imaging. The approach reduces misalignment between organ and station labels and lays groundwork for robust zero-shot multi-modal anatomy classification in radiology, with future work aimed at addressing class imbalance through learned text representations.

Abstract

Vision-language models have emerged as a powerful tool for previously challenging multi-modal classification problem in the medical domain. This development has led to the exploration of automated image description generation for multi-modal clinical scans, particularly for radiology report generation. Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions. In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images. Leveraging the versatility of the Contrastive Language-Image Pre-training (CLIP), we refine and augment the existing approach through multiple experiments, including baseline model fine-tuning, adding station(s) as a superset for better correlation between organs, along with image and language augmentations. Our proposed approach demonstrates 47.6% performance improvement over baseline PubMedCLIP.

Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

TL;DR

This work addresses the problem of automated, whole-body, multi-modal anatomy labeling in MR and CT radiology images by fine-tuning PubMedCLIP on a curated multi-modal dataset of organs and body stations. It introduces image and language augmentations, including diverse prompts that encode modality, orientation, station, and organ, and uses a balanced loss to jointly optimize vision- and text-based predictions. The proposed PMC-MSA model, which combines enhanced data, text prompt diversity, and joint augmentations, achieves a 47.6% average improvement in organ detection and a 27% improvement in station detection over the PubMedCLIP baseline, demonstrating improved cross-modal anatomical understanding in clinical imaging. The approach reduces misalignment between organ and station labels and lays groundwork for robust zero-shot multi-modal anatomy classification in radiology, with future work aimed at addressing class imbalance through learned text representations.

Abstract

Vision-language models have emerged as a powerful tool for previously challenging multi-modal classification problem in the medical domain. This development has led to the exploration of automated image description generation for multi-modal clinical scans, particularly for radiology report generation. Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions. In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images. Leveraging the versatility of the Contrastive Language-Image Pre-training (CLIP), we refine and augment the existing approach through multiple experiments, including baseline model fine-tuning, adding station(s) as a superset for better correlation between organs, along with image and language augmentations. Our proposed approach demonstrates 47.6% performance improvement over baseline PubMedCLIP.
Paper Structure (13 sections, 4 equations, 3 figures, 3 tables)

This paper contains 13 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pipeline of our approach for anatomy classification. (a) Dataset creation -- multi-modal anatomy dataset creation with label pools for detailed caption of images, (b) pre-processing -- image augmentation and language augmentation, involving set of labels from label pool passing through a manual prompt phrasing system providing $10$ different sets of prompts for each image, and (c) baseline model -- PubMedCLIP model with ViT-B/32 as vision encoder and text tokenizer as text encoder will be fine-tuned over these images to give us our proposed model for anatomy detection.
  • Figure 2: AUC-ROC curves for test dataset (visible human project) across different models: (a) result for PMC (baseline), (b) result for PMC-M (fine-tuning over mutli-modal anatomy dataset), and (c) result for PMC-MSA (model with text and image data augmentations). The AUC values are for proposed approach. We receive good AUC values for all organs except humerus (AUC = $0.62$), probably due to imbalance towards other limbs as compared to humerus
  • Figure 3: Examples of performance for different models: (a) result for PMC (baseline), (b) result for PMC-MS, and (c) result for PMC-MSA (proposed approach). Showcasing our approach outperforming the baseline