CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian; Giulia Avanzato; Soufian Belharbi; Alexis Guichemerre; Luke McCaffrey; Mohammadhadi Shateri; Eric Granger

CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

TL;DR

CLIP-IT introduces a practical framework for multimodal learning in histopathology by using unpaired external pathology reports as privileged information during training. It retrieves semantically relevant text per image with a CLIP-based retriever to create pseudo-image–text pairs, then distills text knowledge into the vision model via a two-branch distillation and late fusion mechanism, enabling efficient unimodal inference at test time. The approach achieves consistent accuracy gains over unimodal baselines and competitive performance versus fully paired multimodal methods across PCAM, BACH, and CRC, while incurring minimal inference overhead. This makes CLIP-IT a scalable, privacy-friendly alternative for data-scarce domains, with potential extensions to survival analysis, segmentation, and incorporation of additional modalities such as genomics.

Abstract

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.

CLIP-IT: CLIP-based Pairing for Histology Images Classification

TL;DR

Abstract

CLIP-IT: CLIP-based Pairing for Histology Images Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)