Towards a Visual-Language Foundation Model for Computational Pathology

Ming Y. Lu; Bowen Chen; Drew F. K. Williamson; Richard J. Chen; Ivy Liang; Tong Ding; Guillaume Jaume; Igor Odintsov; Andrew Zhang; Long Phi Le; Georg Gerber; Anil V Parwani; Faisal Mahmood

Towards a Visual-Language Foundation Model for Computational Pathology

Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Andrew Zhang, Long Phi Le, Georg Gerber, Anil V Parwani, Faisal Mahmood

TL;DR

This work introduces CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text, and notably over 1.17 million image-caption pairs via task-agnostic pretraining, with the potential to directly facilitate a wide array of machine learning-based workflows requiring minimal or no further supervised fine-tuning.

Abstract

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of powerful models for various pathology tasks across a diverse array of diseases and patient cohorts. However, model training is often difficult due to label scarcity in the medical domain and the model's usage is limited by the specific task and disease for which it is trained. Additionally, most models in histopathology leverage only image data, a stark contrast to how humans teach each other and reason about histopathologic entities. We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text, and notably over 1.17 million image-caption pairs via task-agnostic pretraining. Evaluated on a suite of 13 diverse benchmarks, CONCH can be transferred to a wide range of downstream tasks involving either or both histopathology images and text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, text-to-image and image-to-text retrieval. CONCH represents a substantial leap over concurrent visual-language pretrained systems for histopathology, with the potential to directly facilitate a wide array of machine learning-based workflows requiring minimal or no further supervised fine-tuning.

Towards a Visual-Language Foundation Model for Computational Pathology

TL;DR

Abstract

Paper Structure (1 section, 3 equations, 15 figures, 38 tables)

This paper contains 1 section, 3 equations, 15 figures, 38 tables.

Figures (15)

Figure : Figure 1: Data curation and model schematic. Caption on next page.
Figure : (Previous page.) Figure 1: Data curation and model schematic.a. Automated data cleaning pipeline. Educational sources (EDU) and parts of the PubMed Central Open Access Dataset (PMC OA) were manually cleaned and used to train an object detector to detect histopathology images, a language model to split captions referring to multiple images, and a matching model to match detected images to their corresponding captions. The cleaning process yields a dataset of 1.79 million image-text pairs, which we then filter out pairs referring to non-humans to create our CONCH (human-only) pretraining dataset of 1.17 million. See Methods for details on data cleaning and Extended Data Figure 9 on performance comparisons using different variations of the pretraining dataset. b. Estimated distribution of image-text pairs in the human-only pretraining dataset by topic. Note that pretraining data covers a diverse range of pathology topics. Inset compares distribution of caption lengths between PMC-Path and EDU. See Extended Data Figure 1 for wordclouds of captions from each category. c. Visual-language pretraining setup. CONCH consists of an image encoder, a text encoder, and a multimodal text decoder. The pretraining process uses both contrastive and captioning objectives. The contrastive objectives align the image and text encoders by maximizing the cosine-similarity scores between paired image and text embeddings while the captioning objective maximizes the likelihood of generating the correct text conditioned on the image and previously generated text. See Methods for details. d. Radarplot comparing performance of CONCH and baselines on various downstream tasks. CONCH outperforms baselines by a significant margin on a diverse set of tasks spanning classification, retrieval, and segmentation. See Results for detailed descriptions of each task and metrics.
Figure : Figure 2: Zero-shot and supervised classification.a. Schematic of zero-shot classification using a pair of contrastively aligned image and text encoders. A prompt is constructed for each class, and the image is classified according to the prompt whose embedding is closest to that of the image in the shared embedding space. b. Zero-shot classification of WSIs. Each WSI is divided into tiles and processed as in a. The similarity scores for tiles are aggregated using top-$K$ pooling to form slide-level similarity scores, the highest of which corresponds to the slide-level prediction. In c, d, dashed lines represent the average over tasks. Error bars represent 95% confidence intervals. c. Zero-shot performance on downstream subtyping (TCGA BRCA, $n=150$; TCGA RCC, $n=225$; TCGA NSCLC, $n=150$; DHMC LUAD, $n=143$; CRC100k, $n=7,180$; WSSS4LUAD, $n=4,693$) and grading (SICAP, $n=2,122$) tasks. Cohen's $\kappa$ is reported for DHMC LUAD and quadratically weighted Cohen's $\kappa$ is reported for SICAP, while balanced accuracy is reported for all other tasks. Additional metrics are reported in Extended Data Tables 1-7. d. Supervised evaluation of embeddings of each model. Linear probing is used for ROI-level tasks (CRC100k and SICAP) while ABMIL is used for slide-level tasks, with the same metrics reported as in c.. See Extended Data Tables 15-19 for more detailed results. e. From left to right: pathologist-annotated invasive ductal carcinoma (IDC), corresponding heatmap, and selected tiles at higher power. Heatmap is colored based on cosine-similarity score between each tile within the slide and the text prompt corresponding the predicted class label. We find excellent agreement between the annotated image and high-similarity regions, with the tiles demonstrating classic IDC morphology within the high-similarity regions and stroma or other normal constituents of the breast in the low similarity regions.
Figure : Figure 3: Slide-level few-shot classification experiments. We investigate the label efficiency of different visual-language pretrained encoders in the few-shot setting where we vary the number of training labels per class ($n_{c}$), for $n_{c} = 1, 2, 4, 8, 16 \ldots$ until we reach the maximum number of available labels in the training set. For each $n_c$, we sample 5 different sets of training examples and train a weakly-supervised ABMIL model on each training set using slide-level labels (see Supervised classification experiments for details). We show their individual model performance via boxplot (i.e., $n=5$ for each box) to study the variance in model performance when performing supervised learning with very few training examples. Boxes indicate quartile values and whiskers extend to data points within 1.5$\times$ the interquartile range. For reference, the zero-shot performance of each model is shown as a dotted line on the same plot. In terms of few-shot supervised learning, CONCH achieves better performance (i.e. in terms of the median accuracy of 5 runs) than other encoders for different sizes of training set and for all tasks. Additionally, CONCH zero-shot performance is surprisingly competitive, outperforming PLIP, BiomedCLIP, and OpenAICLIP few-shot up to 64 labels per class in the case of BRCA and NSCLC subtyping.
Figure : Figure 4: Zero-shot Cross-Modal Retrieval.a. Model performance in cross-modal retrieval was evaluated on 3 datasets of image-text pairs (Source A, $n=797$; Source B, $n=1,755$; TCGA-LUAD, $n=1,65$). Similarity in the embedding space is computed between the query image with all text samples in the database. The top-$K$ most similar texts are retrieved. We report Recall@$K$ for $K\in\{1,5,10\}$ as well as the Mean Recall, which averages over $K$. We show both text-to-image (top row) and image-to-text (bottom row) retrieval for each retrieval task (columns). The rightmost column reports the average across tasks for each metric. CONCH outperforms other baselines on all retrieval tasks. Error bars indicate 95% confidence intervals. b. Schematic for zero-shot image-to-text retrieval (text-to-image is analogous). c. Examples of images in top-5 retrieved results from TCGA LUAD using LUAD-relevant queries with cosine-similarity scores shown in top-right corner. Examples for other datasets using more diverse queries are shown in Extended Data Figure 7. In general, we find the images retrieved by the model match what is described in the text prompt.
...and 10 more figures