Learning biologically relevant features in a pathology foundation model using sparse autoencoders

Nhat Minh Le; Ciyue Shen; Neel Patel; Chintan Shah; Darpan Sanghavi; Blake Martin; Alfred Eng; Daniel Shenker; Harshith Padigela; Raymond Biju; Syed Ashar Javed; Jennifer Hipp; John Abel; Harsha Pokkalla; Sean Grullon; Dinkar Juyal

Learning biologically relevant features in a pathology foundation model using sparse autoencoders

Nhat Minh Le, Ciyue Shen, Neel Patel, Chintan Shah, Darpan Sanghavi, Blake Martin, Alfred Eng, Daniel Shenker, Harshith Padigela, Raymond Biju, Syed Ashar Javed, Jennifer Hipp, John Abel, Harsha Pokkalla, Sean Grullon, Dinkar Juyal

TL;DR

We address interpretability of pathology foundation-model embeddings by applying sparse autoencoders to 384‑dim CLS embeddings from PLUTO, training across layers 1–12 with the loss $L = \frac{1}{k} \sum_{i=1}^{k} \|x_i - \hat{x}_i\|_2^2 + \lambda \sum_{i=1}^{k} \|f_i\|_1$ to obtain monosemantic feature dimensions. SAE dimensions correlate with counts of key cell types (plasma cells, lymphocytes, cancer cells, fibroblasts, macrophages), and monosemanticity improves in deeper layers, with representations robust to scanner and stain variation and generalizable to CPTAC (out-of-domain) data. Cross-model universality is demonstrated by high cross-model correlations (e.g., plasma-cell concepts ρ≈0.96; anthracotic macrophages ρ≈0.91) between independently trained SAEs, and clustering reveals interpretable feature dictionaries tied to histological concepts. Overall, the work shows that pathology foundation-model embeddings can yield biologically grounded, generalizable, and interpretable representations that can support downstream clinical tasks and mechanistic investigations in pathology.

Abstract

Pathology plays an important role in disease diagnosis, treatment decision-making and drug development. Previous works on interpretability for machine learning models on pathology images have revolved around methods such as attention value visualization and deriving human-interpretable features from model heatmaps. Mechanistic interpretability is an emerging area of model interpretability that focuses on reverse-engineering neural networks. Sparse Autoencoders (SAEs) have emerged as a promising direction in terms of extracting monosemantic features from polysemantic model activations. In this work, we trained a Sparse Autoencoder on the embeddings of a pathology pretrained foundation model. We found that Sparse Autoencoder features represent interpretable and monosemantic biological concepts. In particular, individual SAE dimensions showed strong correlations with cell type counts such as plasma cells and lymphocytes. These biological representations were unique to the pathology pretrained model and were not found in a self-supervised model pretrained on natural images. We demonstrated that such biologically-grounded monosemantic representations evolved across the model's depth, and the pathology foundation model eventually gained robustness to non-biological factors such as scanner type. The emergence of biologically relevant SAE features was generalizable to an out-of-domain dataset. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications.

Learning biologically relevant features in a pathology foundation model using sparse autoencoders

TL;DR

We address interpretability of pathology foundation-model embeddings by applying sparse autoencoders to 384‑dim CLS embeddings from PLUTO, training across layers 1–12 with the loss

to obtain monosemantic feature dimensions. SAE dimensions correlate with counts of key cell types (plasma cells, lymphocytes, cancer cells, fibroblasts, macrophages), and monosemanticity improves in deeper layers, with representations robust to scanner and stain variation and generalizable to CPTAC (out-of-domain) data. Cross-model universality is demonstrated by high cross-model correlations (e.g., plasma-cell concepts ρ≈0.96; anthracotic macrophages ρ≈0.91) between independently trained SAEs, and clustering reveals interpretable feature dictionaries tied to histological concepts. Overall, the work shows that pathology foundation-model embeddings can yield biologically grounded, generalizable, and interpretable representations that can support downstream clinical tasks and mechanistic investigations in pathology.

Abstract

Paper Structure (23 sections, 11 figures, 1 table)

This paper contains 23 sections, 11 figures, 1 table.

Introduction
Mechanistic Interpretability
Interpretability in Pathology
Summary of Contributions
Method
Datasets
Embedding extraction
Biological and color feature extraction
Sparse autoencoder training
Training a sparse autoencoder on PLUTO embeddings reveals interpretable features
Interpretability analysis of PLUTO embeddings
PLUTO SAE dimensions represent interpretable pathology-relevant concepts
Comparison to non-pathology ViT model
Evaluation of SAE monosemanticity using pathology-relevant cellular features
Emergence of monosemanticity across PLUTO layers
...and 8 more sections

Figures (11)

Figure 1: Feature visualization of SAE hidden dimensions revealed interpretable dictionary of pathology features. For each SAE hidden dimension of model A (trained on the TCGA dataset) and model B (trained on the 1M dataset), 4 out of the top 16 images that activated that dimension were visualized. Manual examination revealed interpretable features represented by these dimensions. For model A, these include cell and tissue features specific to H & E stain (top row: poorly differentiated carcinoma with distinct cell separation, red blood cells, mucin); geometric features (middle row: edge of tissue, clefting between cancer and stroma, diagonal fibers); staining and artifact features (bottom row: blur, sectioning artifact, red stain). For model B, some SAE dimensions were specific to H & E stain (first column: collagen-enriched fibroblasts, circular clusters of tumor cells, surgical ink), some were specific to IHC stain (second column: stained lymphocytes, edge of tissue, blur), and others generalized across stains (third column: large cancer cells, vertical structures, tissue folds).
Figure 2: UMAP of 3072 SAE dimensions from model trained on the 1M dataset. Feature clusters were identified by HDBSCAN and were interpreted by manual inspection. Several clusters clearly associated with histological concepts were highlighted. For cancer and immune cell clusters, visualizations of top 3 patches that maximally activate the SAE dimension were shown.
Figure 3: Pearson correlations of SAE dimensions of PLUTO and DINO models with counts of pathology-relevant cell types, showing much higher correlations of the PLUTO SAE dimensions with the cell count features.
Figure 4: SAE-1736 monosemantically encoded plasma cell-specific information. Top panels show the average cell counts across bins of (A) SAE-1736 activation values, and (B) PLUTO dimension 148. Average plasma cell counts (shown in purple) increased linearly with increasing SAE-1736 activation values, while counts of other cell types decreased or remained constant. In contrast, counts of lymphocytes, macrophages, and plasma cells all increased monotonically with increasing PLUTO-148 feature values. C) Correlation between SAE-1736 activation and counts of five cell types, showing monosemantic correlation with only the plasma cell counts. D) Same as C, but for PLUTO dimension 148
Figure 5: Monosemanticity emerged in later layers of PLUTO A) Correlation of cell count features with dimensions of SAE models trained on the embeddings of PLUTO across layers. In each layer, the five SAE dimensions with the highest correlation with the counts of each of the five cell classes were plotted. B) Correlation of color features with dimensions of SAE models trained on the embeddings of different layers of PLUTO C) For each layer, we found the SAE dimension with the highest correlation with count of plasma cells. We then measured monosemanticity of that dimension by calculating the correlation with other cell type counts. D) Entropy of the best plasma-cell SAE for each layer with respect to the five cell types (lower entropy implies higher monosemanticity). Red dotted line represents maximum possible entropy ($S_{max} = 0.70$)
...and 6 more figures

Learning biologically relevant features in a pathology foundation model using sparse autoencoders

TL;DR

Abstract

Learning biologically relevant features in a pathology foundation model using sparse autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (11)