Table of Contents
Fetching ...

PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner

TL;DR

This work tackles the challenge of generating pathology reports from multiple WSIs by leveraging a long-context multimodal model, Gemini 1.5 Flash, fine-tuned with LoRA to produce part-level findings from thousands of image patches at 10X magnification. PolyPath integrates patches across all slides in a part (up to 50 slides) and is trained to predict both tissue labels and diagnostic findings, with a baseline single-slide approach used for comparison. Expert pathologists evaluate PolyPath against baselines and original reports, finding that PolyPath matches or exceeds the original in about 68% of up-to-5-slide cases, and consistently outperforms single-slide methods on NLG metrics, though performance declines as the number of slides increases due to data sparsity. The study demonstrates the feasibility and promise of long-context LMMs for multi-slide clinical reporting while recognizing limitations in generalizability and suggesting directions for case-level modeling and improved spatial integration of patches.

Abstract

The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.

PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

TL;DR

This work tackles the challenge of generating pathology reports from multiple WSIs by leveraging a long-context multimodal model, Gemini 1.5 Flash, fine-tuned with LoRA to produce part-level findings from thousands of image patches at 10X magnification. PolyPath integrates patches across all slides in a part (up to 50 slides) and is trained to predict both tissue labels and diagnostic findings, with a baseline single-slide approach used for comparison. Expert pathologists evaluate PolyPath against baselines and original reports, finding that PolyPath matches or exceeds the original in about 68% of up-to-5-slide cases, and consistently outperforms single-slide methods on NLG metrics, though performance declines as the number of slides increases due to data sparsity. The study demonstrates the feasibility and promise of long-context LMMs for multi-slide clinical reporting while recognizing limitations in generalizability and suggesting directions for case-level modeling and improved spatial integration of patches.

Abstract

The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.

Paper Structure

This paper contains 12 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Model overview.PolyPath is an LMM based on Gemini 1.5 Flash that takes in tissue-containing image patches from multiple slides for a part as well as the part-level specimen label, generating part-level report text across a wide range of tissue sites, procedures, and pathologies. The flow above reflects usage at inference time; during training, the model is trained to predict both the label and finding sections.
  • Figure 2: Pathologist evaluation. Human evaluations conducted with two board-certified pathologists for 104 selected examples taken from the P2-5 subset of the test set. The scoring rubric is described in Supplemental \ref{['tab:rubric']}. (a) Average pathologist scores (higher is better) for report text generated by our three models (SS-Random, SS-LLM, PolyPath) and the original bottom-line part-level diagnosis. Confidence intervals were generated via bootstrap resampling at the part-level with 10000 replicates. (b) Comparison of ratings for report text generated by PolyPath vs the original report text for all parts as well as for subsets of parts with normal, mild, and significant findings. For this plot, we exclude 3 examples for which at least one pathologist rated both the PolyPath generated text and the original text at a score of 3 or lower.
  • Figure B.1: Pathologist evaluations per individual rater, corresponding to \ref{['fig:pathologistresults']}. Plots for the P2-5 test split are shown for (a) average rating for generated report text from the SS-Random, SS-LLM and PolyPath models as well as the original report text, and (b) preference plots for generated report text from PolyPath vs. the original report texts.
  • Figure B.2: Preference plots for single-slide models. Comparison of ratings for report text generated by (a) SS-Random and (b) SS-LLM vs the original report text. Similar to the analysis in \ref{['fig:pathologistresults']}, we exclude examples where both the model output and the original text receive a score of 3 or lower by at least one pathologist.
  • Figure B.3: Distribution of ratings. Overall distribution of ratings per pathologist over the P2-5 test set.
  • ...and 8 more figures