PathAlign: A vision-language model for whole slide images in histopathology

Faruk Ahmed; Andrew Sellergren; Lin Yang; Shawn Xu; Boris Babenko; Abbi Ward; Niels Olson; Arash Mohtashamian; Yossi Matias; Greg S. Corrado; Quang Duong; Dale R. Webster; Shravya Shetty; Daniel Golden; Yun Liu; David F. Steiner; Ellery Wulczyn

PathAlign: A vision-language model for whole slide images in histopathology

Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S. Corrado, Quang Duong, Dale R. Webster, Shravya Shetty, Daniel Golden, Yun Liu, David F. Steiner, Ellery Wulczyn

TL;DR

PathAlign tackles slide-level image-text alignment for gigapixel histopathology WSIs by pairing WSIs with narrative pathology reports and employing a BLIP-2–based architecture. It fuses a frozen pathology-specific patch encoder (PathSSL) with a frozen LLM to realize both cross-modal retrieval (PathAlign-R) and generation-enabled capabilities (PathAlign-G), trained on a large, real-world DS1 dataset and enriched with TCGA data. The approach yields strong retrieval performance (top-1 73.5%, top-3 91.3%) and high-quality generated text (78% rated 4–5 by pathologists), along with competitive WSI classification across several diagnostic tasks. The work demonstrates the potential of language-aligned WSI embeddings for automatic report generation and AI-assisted workflows, including case prioritization, while acknowledging limitations in slide-to-text mapping and the need for broader generalization across data sources.

Abstract

Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

PathAlign: A vision-language model for whole slide images in histopathology

TL;DR

Abstract

Paper Structure (36 sections, 10 figures, 12 tables)

This paper contains 36 sections, 10 figures, 12 tables.

Introduction
Related work
Methods
Data
Curating image--text pairs
Data splits
Modeling
Patch sampling
Patch encoder
WSI encoder
Image--text alignment
Evaluation
Text retrieval and generation
WSI classification
Results
...and 21 more sections

Figures (10)

Figure 1: Model overview.PathAlign provides aligned WSI and text embeddings enabling embedding-based cross-modal retrieval and WSI classification. The WSI-encoder is further aligned with a frozen large language model (LLM), enabling applications such as text generation and visual question answering. The model is trained largely following the BLIP-2 approach (see \ref{['sec:modeling']} for details), making use of a frozen patch-level, histopathology-specialized embedding model (PathSSL) and a frozen LLM.
Figure 2: WSI--text association types in real world data. We associate each WSI with part-level text from the original report. Due to the part, block, slide hierarchy and variability in accessioning, there are three high-level categories of association between slides and part-level text. The probability that some of the information in the part-level text does not apply to a given slide increases from category 1 to category 3.
Figure 3: Pathologist evaluation of image-to-text retrievals and generated text. (a) For embedding-based retrievals, top-K accuracy is shown, using a rating of 4 or 5 to indicate accurate text without clinically significant errors or omissions. Original refers to pathologist evaluation of the original diagnostic text. (b) Per WSI comparison of ratings for the generated text and the original diagnostic text. Ratings for which both AI and original text received a score of 3 or lower are excluded in this plot: 21 ratings excluded in total (9%) with 3 from normal (4% of normal), 11 from mild (11% of mild), and 7 from significant (13% of significant). The score-based definitions of these categories are provided in Supplemental \ref{['tab:compare-categories']}. The mild category includes a range of findings such as inflammation, benign conditions, and adenomas. The significant category includes carcinoma, dysplasia, and findings with direct implications for clinical management.
Figure B.1: Overview of pathology case accessioning. Pathology specimens are typically processed and accessioned by case, part, block, and slide. A single case may have several different parts and a single part may have several different blocks, with each block sectioned (i.e. cut) to provide one or more slides for histopathology review.
Figure B.2: Example part-level diagnostic text from DS1. An example of final diagnosis text from a pathology report for a colorectal biopsy case, with reports split by part. Information that may not be determined from the images (e.g. sample location, tumor size) is removed on a best effort basis via regular expressions, with example removals indicated by strikethrough text in this figure.
...and 5 more figures

PathAlign: A vision-language model for whole slide images in histopathology

TL;DR

Abstract

PathAlign: A vision-language model for whole slide images in histopathology

Authors

TL;DR

Abstract

Table of Contents

Figures (10)