Table of Contents
Fetching ...

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Saarthak Kapse, Pushpak Pati, Srikar Yellapragada, Srijan Das, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras, Prateek Prasanna

TL;DR

GECKO introduces a zero- or few-label pretraining framework for gigapixel histopathology slides that aligns WSI representations with a task-specific Concept Prior derived from interpretable pathology concepts. The model comprises two branches: a deep-encoding branch that aggregates patch features and a concept-encoding branch that aggregates concept priors, trained with a symmetric CLIP-style contrastive loss. This design yields accurate WSI-level embeddings while preserving interpretability through explicit concept activations, and it can seamlessly incorporate auxiliary modalities like transcriptomics when available. Across five tasks and multiple evaluation settings, GECKO achieves state-of-the-art or competitive performance, demonstrates strong generalization, and provides pathologist-friendly explanations of its predictions.

Abstract

Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO

GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

TL;DR

GECKO introduces a zero- or few-label pretraining framework for gigapixel histopathology slides that aligns WSI representations with a task-specific Concept Prior derived from interpretable pathology concepts. The model comprises two branches: a deep-encoding branch that aggregates patch features and a concept-encoding branch that aggregates concept priors, trained with a symmetric CLIP-style contrastive loss. This design yields accurate WSI-level embeddings while preserving interpretability through explicit concept activations, and it can seamlessly incorporate auxiliary modalities like transcriptomics when available. Across five tasks and multiple evaluation settings, GECKO achieves state-of-the-art or competitive performance, demonstrates strong generalization, and provides pathologist-friendly explanations of its predictions.

Abstract

Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO

Paper Structure

This paper contains 20 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Unlike conventional unimodal or multimodal pretraining of WSI-level MIL aggregators, GECKO aligns a WSI with an interpretable Concept Prior derived from the WSI and task-relevant pathology concepts. Alongside downstream unsupervised and supervised performance benefits, GECKO provides WSI-level pathologist-friendly interpretable descriptors.
  • Figure 2: Overview of : (a) We start by extracting an interpretable, task-relevant Concept Prior from a using a and a . (b) Next, we pretrain a dual-branch by contrastively aligning -level deep and concept embeddings. (c) These embeddings can be used for supervised learning via linear probing. (d) Additionally, the concept embedding can be directly used for unsupervised learning using a pathologist-driven heuristic.
  • Figure 3: Few-labels (in-domain) classification analysis. The AUC results are obtained through linear probing. Dashed lines indicate pretraining using only data, while solid lines represent multimodal pretraining with additional transcriptomics data. CONCH is utilized for extracting deep features for image patches.
  • Figure 4: Effect of false negative elimination keep ratio ($r_{keep}$). All AUC values are reported in an unsupervised setting (5-fold cross validation) using our proposed heuristic. $r_{keep}=0.7$ was found to work consistently well across all tasks.
  • Figure 5: Few Labels (in domain) classification analysis. All AUC results are with linear probing. Dashed lines represent pretraining on WSI only, and solid lines represents multimodal pretraining with gene data. CONCH is used for extracting deep features.
  • ...and 2 more figures