Table of Contents
Fetching ...

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

Yuexi Du, Brian Chang, Nicha C. Dvornek

TL;DR

This work tackles the data and compute challenges of applying CLIP-style language-image learning to medical imaging. It introduces CLEFT, a parameter-efficient framework that integrates a billion-parameter LLM via PEFT and a context-based prompt-learning stage to improve cross-modal alignment while drastically reducing trainable parameters. By treating the CLIP training as knowledge distillation from the LLM to the vision encoder, CLEFT achieves state-of-the-art results on CheXpert-5x200, RSNA pneumonia, and EMBED mammography, with a notable $39\%$ reduction in total trainable parameters and only $4\%$ of trainable LM parameters. This method offers a practical path for deploying powerful language models in medical imaging with reduced training costs and better cross-domain generalization.

Abstract

Recent advancements in Contrastive Language-Image Pre-training (CLIP) have demonstrated notable success in self-supervised representation learning across various tasks. However, the existing CLIP-like approaches often demand extensive GPU resources and prolonged training times due to the considerable size of the model and dataset, making them poor for medical applications, in which large datasets are not always common. Meanwhile, the language model prompts are mainly manually derived from labels tied to images, potentially overlooking the richness of information within training samples. We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT) that harnesses the strengths of the extensive pre-trained language and visual models. Furthermore, we present an efficient strategy for learning context-based prompts that mitigates the gap between informative clinical diagnostic data and simple class labels. Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets compared with various baselines. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

TL;DR

This work tackles the data and compute challenges of applying CLIP-style language-image learning to medical imaging. It introduces CLEFT, a parameter-efficient framework that integrates a billion-parameter LLM via PEFT and a context-based prompt-learning stage to improve cross-modal alignment while drastically reducing trainable parameters. By treating the CLIP training as knowledge distillation from the LLM to the vision encoder, CLEFT achieves state-of-the-art results on CheXpert-5x200, RSNA pneumonia, and EMBED mammography, with a notable reduction in total trainable parameters and only of trainable LM parameters. This method offers a practical path for deploying powerful language models in medical imaging with reduced training costs and better cross-domain generalization.

Abstract

Recent advancements in Contrastive Language-Image Pre-training (CLIP) have demonstrated notable success in self-supervised representation learning across various tasks. However, the existing CLIP-like approaches often demand extensive GPU resources and prolonged training times due to the considerable size of the model and dataset, making them poor for medical applications, in which large datasets are not always common. Meanwhile, the language model prompts are mainly manually derived from labels tied to images, potentially overlooking the richness of information within training samples. We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT) that harnesses the strengths of the extensive pre-trained language and visual models. Furthermore, we present an efficient strategy for learning context-based prompts that mitigates the gap between informative clinical diagnostic data and simple class labels. Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets compared with various baselines. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
Paper Structure (22 sections, 1 equation, 3 figures, 6 tables)

This paper contains 22 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Zero-shot Performance on CheXpert-5x200. We compare the performance of our method with multiple baselines on zero-shot CheXpert-5x200 classification. For each model, the diameter denotes the total number of trainable parameters, the color shows the number of trainable text encoder parameters, and the number reports accuracy. Our method outperforms all baselines with better parameter efficiency.
  • Figure 2: Proposed Method Framework. (a) Language-image contrastive learning with an LLM by utilizing PEFT. Fixed handcrafted prompts are used in this stage. (b) Prompt context learning with the pre-trained image and text encoder via classification.
  • Figure 3: Zero-shot Accuracy vs. Number of Trainable Prompt Tokens.