Table of Contents
Fetching ...

CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

TL;DR

CLIP-IT introduces a practical framework for multimodal learning in histopathology by using unpaired external pathology reports as privileged information during training. It retrieves semantically relevant text per image with a CLIP-based retriever to create pseudo-image–text pairs, then distills text knowledge into the vision model via a two-branch distillation and late fusion mechanism, enabling efficient unimodal inference at test time. The approach achieves consistent accuracy gains over unimodal baselines and competitive performance versus fully paired multimodal methods across PCAM, BACH, and CRC, while incurring minimal inference overhead. This makes CLIP-IT a scalable, privacy-friendly alternative for data-scarce domains, with potential extensions to survival analysis, segmentation, and incorporation of additional modalities such as genomics.

Abstract

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.

CLIP-IT: CLIP-based Pairing for Histology Images Classification

TL;DR

CLIP-IT introduces a practical framework for multimodal learning in histopathology by using unpaired external pathology reports as privileged information during training. It retrieves semantically relevant text per image with a CLIP-based retriever to create pseudo-image–text pairs, then distills text knowledge into the vision model via a two-branch distillation and late fusion mechanism, enabling efficient unimodal inference at test time. The approach achieves consistent accuracy gains over unimodal baselines and competitive performance versus fully paired multimodal methods across PCAM, BACH, and CRC, while incurring minimal inference overhead. This makes CLIP-IT a scalable, privacy-friendly alternative for data-scarce domains, with potential extensions to survival analysis, segmentation, and incorporation of additional modalities such as genomics.

Abstract

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.

Paper Structure

This paper contains 26 sections, 5 equations, 9 figures, 7 tables, 3 algorithms.

Figures (9)

  • Figure 1: Overview of four learning approaches: (a) unimodal setting (image-only), (b) paired multimodal setting, requiring aligned image–text pairs at both training and inference, (c) Prompt-based CLIP-style VLMs (e.g., CONCH), trained on paired data and needing text prompts at inference, and (d) the proposed CLIP-IT setting that uses unpaired external reports for multimodal supervision during training, but supports lightweight unimodal inference of downstream dataset.
  • Figure 2: Illustration of our CLIP-IT method: a) Image-Text Modality Pairing: Each histology image is paired with the most semantically similar text report from an external unpaired dataset using a pretrained CLIP-based model, b) CLIP-IT Multimodal Distillation: A joint model is trained using both vision and text encoders with a logit fusion mechanism and feature-level distillation. c) Unimodal Inference: only the vision encoder and the learned projection modules are used, enabling a lightweight and unimodal prediction pipeline.
  • Figure 3: Pareto frontier plots showing the trade-off between model accuracy and parameter size across three histology datasets. Each point style represents a model configuration (Unimodal, CLIP-IT, or Multimodal), with color indicating the architecture. CLIP-IT consistently pushes unimodal models closer to or onto the frontier, offering an efficient alternative to heavier multimodal baselines.
  • Figure 4: Histogram of $\mathrm{\Omega}$ scores (\ref{['eq:omega']}) across datasets and backbones, showing the $\%$ of samples correctly classified by text but missed by vision, i.e., the complementary info of text. Bars denote models, with numeric values $(\mathrm{\Omega}\times100)$ above each. Higher scores indicate greater potential benefit from textual supervision.
  • Figure 5: Ablation study results showing the classification accuracy of various configurations on UNI and PCAM. The bars represent modifications, including pairing strategies (2nd–5th top), text corruption by k% word removal, early fusion, full fine-tuning, and component removals. The dashed line is the unimodal baseline.
  • ...and 4 more figures