Table of Contents
Fetching ...

Molecular-driven Foundation Model for Oncologic Pathology

Anurag Vaidya, Andrew Zhang, Guillaume Jaume, Andrew H. Song, Tong Ding, Sophia J. Wagner, Ming Y. Lu, Paul Doucet, Harry Robertson, Cristina Almagro-Perez, Richard J. Chen, Dina ElHarouni, Georges Ayoub, Connor Bossi, Keith L. Ligon, Georg Gerber, Long Phi Le, Faisal Mahmood

TL;DR

Threads introduces a slide-level foundation encoder trained through multimodal contrastive learning to align whole-slide images with corresponding molecular profiles, enabling universal WSI embeddings. It leverages MBTG-47k, a large multimodal dataset of 47k histology images paired with RNA and DNA profiles, to capture tissue morphology and molecular composition. Across 54 oncology tasks spanning 23 cohorts, Threads achieves state-of-the-art performance with strong generalization, label efficiency, and robustness in rare-event predictions, while enabling retrieval and molecular prompting as zero-shot or few-shot capabilities. The work demonstrates data-efficient, transferable representations and outlines paths toward public release and broader clinical impact.

Abstract

Foundation models are reshaping computational pathology by enabling transfer learning, where models pre-trained on vast datasets can be adapted for downstream diagnostic, prognostic, and therapeutic response tasks. Despite these advances, foundation models are still limited in their ability to encode the entire gigapixel whole-slide images without additional training and often lack complementary multimodal data. Here, we introduce Threads, a slide-level foundation model capable of generating universal representations of whole-slide images of any size. Threads was pre-trained using a multimodal learning approach on a diverse cohort of 47,171 hematoxylin and eosin (H&E)-stained tissue sections, paired with corresponding genomic and transcriptomic profiles - the largest such paired dataset to be used for foundation model development to date. This unique training paradigm enables Threads to capture the tissue's underlying molecular composition, yielding powerful representations applicable to a wide array of downstream tasks. In extensive benchmarking across 54 oncology tasks, including clinical subtyping, grading, mutation prediction, immunohistochemistry status determination, treatment response prediction, and survival prediction, Threads outperformed all baselines while demonstrating remarkable generalizability and label efficiency. It is particularly well suited for predicting rare events, further emphasizing its clinical utility. We intend to make the model publicly available for the broader community.

Molecular-driven Foundation Model for Oncologic Pathology

TL;DR

Threads introduces a slide-level foundation encoder trained through multimodal contrastive learning to align whole-slide images with corresponding molecular profiles, enabling universal WSI embeddings. It leverages MBTG-47k, a large multimodal dataset of 47k histology images paired with RNA and DNA profiles, to capture tissue morphology and molecular composition. Across 54 oncology tasks spanning 23 cohorts, Threads achieves state-of-the-art performance with strong generalization, label efficiency, and robustness in rare-event predictions, while enabling retrieval and molecular prompting as zero-shot or few-shot capabilities. The work demonstrates data-efficient, transferable representations and outlines paths toward public release and broader clinical impact.

Abstract

Foundation models are reshaping computational pathology by enabling transfer learning, where models pre-trained on vast datasets can be adapted for downstream diagnostic, prognostic, and therapeutic response tasks. Despite these advances, foundation models are still limited in their ability to encode the entire gigapixel whole-slide images without additional training and often lack complementary multimodal data. Here, we introduce Threads, a slide-level foundation model capable of generating universal representations of whole-slide images of any size. Threads was pre-trained using a multimodal learning approach on a diverse cohort of 47,171 hematoxylin and eosin (H&E)-stained tissue sections, paired with corresponding genomic and transcriptomic profiles - the largest such paired dataset to be used for foundation model development to date. This unique training paradigm enables Threads to capture the tissue's underlying molecular composition, yielding powerful representations applicable to a wide array of downstream tasks. In extensive benchmarking across 54 oncology tasks, including clinical subtyping, grading, mutation prediction, immunohistochemistry status determination, treatment response prediction, and survival prediction, Threads outperformed all baselines while demonstrating remarkable generalizability and label efficiency. It is particularly well suited for predicting rare events, further emphasizing its clinical utility. We intend to make the model publicly available for the broader community.

Paper Structure

This paper contains 1 section, 5 equations, 11 figures, 59 tables.

Table of Contents

  1. Data splits.

Figures (11)

  • Figure 1: Study overview.a. Tissue site distribution of MBTG-47k used for $\textsc{Threads}\xspace$ pretraining. b. 2-dimensional tSNE van2008visualizing representation of $\textsc{Threads}\xspace$ WSI embedding space on MBTG-47k colored by primary organ. Each point is a WSI. c. Block diagram of $\textsc{Threads}\xspace$ architecture for WSI representation learning. d. Overview of $\textsc{Threads}\xspace$ downstream evaluation composed of 54 tasks. Tasks are grouped into four families: clinical subtyping and grading (n=8 tasks), gene mutation prediction (n=21 tasks), immunohistochemistry status prediction (n=12 tasks), and treatment response and survival prediction (n=13 tasks). WSI: whole-slide image; tSNE: t-distributed stochastic neighbor embedding; MGH: Mass General Hospital; BWH: Brigham and Women's Hospital.
  • Figure 1: Detailed architecture of Threads.Threads employs a multimodal contrastive learning approach to align a whole-slide image representation with its corresponding molecular profile, obtained either using a DNA or RNA assay. a. The vision encoding branch uses a multihead attention-based model to pool patch embeddings into a slide embedding. b. The RNA encoding branch uses an scGPT model pretrained on 5.7 million cells of various cancer types, which is fully fine-tuned to yield a transcriptome embedding. c. The DNA encoding branch uses a multilayer perceptron (MLP) to transform copy number variations (CNV), insertions and deletions (indels), and single nucleotide variants (SNV) into a genomic embedding. WSI: whole-slide image; ViT: vision transformer; concat.: concatenations; TPM: transcripts per million.
  • Figure 2: Evaluation of Threads and baselines with linear probing.a. Average performance of Threads and baselines on 54 tasks. Threads is compared against Prism, GigaPath, and Chief. Average performance per family of tasks: b. clinical subtyping and grading (8 tasks), c. mutation prediction (21 tasks), d. IHC status prediction (12 tasks), and e. treatment response and survival prediction tasks (13 tasks). f--kThreads performance on treatment response and prognostication tasks characterized by label scarcity (n=36 to n=144 patients). Binary tasks (f--i) are measured with AUC. Survival tasks (j,k) are measured with concordance-index. f. Temozolomide treatment response in glioblastoma (GBM). g. Bevacizumab treatment response in ovarian cancer (OV). h. Neoadjuvant response assessment in invasive breast cancer (BRCA). i. Hormonal therapy response in prostate adenocarcinoma (PRAD). j. Overall survival (OS) prediction in pancreatic ductal adenocarcinoma (PDAC). k. Overall survival prediction in colon adenocarcinoma (COAD). l. Number of tasks where each model (Threads and baselines) reaches highest performance across all tasks (n=54 tasks), treatment response (n=6 tasks) and survival tasks (n=7 tasks). m. Few-shot learning performance of Threads against baselines in brain tumor subtyping. $k$ refers to the number of training samples per class. Error bars represent the standard error measured across multiple folds. Boxes indicate quartile values of model performance (n=5 runs), and whiskers extend to data points within 1.5-fold the interquartile range. Task-wise P-values were determined using two-sided Tukey Honest Significance Difference tests accounting for multiple comparisons following a positive result (P$<$0.05) of a two-way ANOVA. Statistical significance across multiple tasks (e.g., for each family) was assessed using a mixed-effects model. P$<$0.05: *, P$<$0.01: **, P$<$0.001: ***.
  • Figure 2: Clustering capabilities of Threads. 2-dimensional tSNE representation of CPTAC cohort stratified by cancer type (n=10 cancer types) using Threads (a.), PRISM (b.), GigaPath (c.), and CHIEF (d.). 2-dimensional tSNE representation of EBRAINS cohort stratified by tumor type (n=12 tumor types) using Threads (e.), PRISM (f.), GigaPath (g.), and CHIEF (g.). ARI: Adjusted random index; MI: Mutual information; tSNE: t-distributed stochastic neighbor embedding.
  • Figure 3: Threads fine-tuning.a. Average performance of Threads and baselines finetuned on 54 benchmarking tasks, along with average performance for each family of tasks: b. clinical subtyping and grading (8 tasks), c. mutation prediction (21 tasks), d. IHC status prediction (12 tasks), and e. treatment response and survival prediction (13 tasks). Task-wise comparison of Threads and baselines finetuned on individual tasks: f. RAS status prediction in SURGEN colorectal adenocarcinoma (COAD). g. TP53 mutation prediction in CPTAC-COAD. h. PIK3CA mutation prediction in CPTAC breast invasive carcinoma (BRCA). i. Bevacizumab response prediction in ovarian cancer with fine-tuning. j. Temozolomide response prediction in MGB glioblastoma (GBM). k. Comparison of Threads fine-tuning vs. training a Threads model from scratch on our benchmark and families of tasks. l. Task-wise performance of Threads fine-tuning vs.Threads randomly initialized on ten representative tasks. Error bars represent standard error, and the centers correspond to the mean computed values of each metric. Task-wise P-values were determined using two-sided Tukey Honest Significance Difference tests accounting for multiple comparisons following a positive result (P$<$0.05) of a two-way ANOVA. Statistical significance across multiple tasks (e.g., for each family) was assessed using a mixed-effects model. P$<$0.05: *, P$<$0.01: **, P$<$0.001: ***.
  • ...and 6 more figures