Learning biologically relevant features in a pathology foundation model using sparse autoencoders
Nhat Minh Le, Ciyue Shen, Neel Patel, Chintan Shah, Darpan Sanghavi, Blake Martin, Alfred Eng, Daniel Shenker, Harshith Padigela, Raymond Biju, Syed Ashar Javed, Jennifer Hipp, John Abel, Harsha Pokkalla, Sean Grullon, Dinkar Juyal
TL;DR
We address interpretability of pathology foundation-model embeddings by applying sparse autoencoders to 384‑dim CLS embeddings from PLUTO, training across layers 1–12 with the loss $L = \frac{1}{k} \sum_{i=1}^{k} \|x_i - \hat{x}_i\|_2^2 + \lambda \sum_{i=1}^{k} \|f_i\|_1$ to obtain monosemantic feature dimensions. SAE dimensions correlate with counts of key cell types (plasma cells, lymphocytes, cancer cells, fibroblasts, macrophages), and monosemanticity improves in deeper layers, with representations robust to scanner and stain variation and generalizable to CPTAC (out-of-domain) data. Cross-model universality is demonstrated by high cross-model correlations (e.g., plasma-cell concepts ρ≈0.96; anthracotic macrophages ρ≈0.91) between independently trained SAEs, and clustering reveals interpretable feature dictionaries tied to histological concepts. Overall, the work shows that pathology foundation-model embeddings can yield biologically grounded, generalizable, and interpretable representations that can support downstream clinical tasks and mechanistic investigations in pathology.
Abstract
Pathology plays an important role in disease diagnosis, treatment decision-making and drug development. Previous works on interpretability for machine learning models on pathology images have revolved around methods such as attention value visualization and deriving human-interpretable features from model heatmaps. Mechanistic interpretability is an emerging area of model interpretability that focuses on reverse-engineering neural networks. Sparse Autoencoders (SAEs) have emerged as a promising direction in terms of extracting monosemantic features from polysemantic model activations. In this work, we trained a Sparse Autoencoder on the embeddings of a pathology pretrained foundation model. We found that Sparse Autoencoder features represent interpretable and monosemantic biological concepts. In particular, individual SAE dimensions showed strong correlations with cell type counts such as plasma cells and lymphocytes. These biological representations were unique to the pathology pretrained model and were not found in a self-supervised model pretrained on natural images. We demonstrated that such biologically-grounded monosemantic representations evolved across the model's depth, and the pathology foundation model eventually gained robustness to non-biological factors such as scanner type. The emergence of biologically relevant SAE features was generalizable to an out-of-domain dataset. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications.
