Table of Contents
Fetching ...

Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs

Rachit Saluja, Jacob Rosenthal, Yoav Artzi, David J. Pisapia, Benjamin L. Liechty, Mert R. Sabuncu

TL;DR

This study benchmarks large language models for extracting cancer type, AJCC stage, and prognosis from unstructured pathology reports using TCGA-derived data. It demonstrates that LLMs excel at cancer type identification but face greater challenges in staging and prognosis; instruction tuning yields substantial gains, enabling smaller open-source models to approach or surpass larger closed models in several tasks. Path-llama3.1-8B and Path-GPT-4o-mini-FT emerge as top performers, with strong zero-shot results and efficient deployment via LoRA refinements and JSON-format outputs. The work highlights practical directions for clinical deployment, including privacy-preserving, resource-efficient models and potential enhancements through retrieval-augmented generation and multimodal data integration. Collectively, the findings support scalable, data-driven pathology analysis and pave the way for broader, privacy-conscious AI-assisted oncology workflows.

Abstract

Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.

Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs

TL;DR

This study benchmarks large language models for extracting cancer type, AJCC stage, and prognosis from unstructured pathology reports using TCGA-derived data. It demonstrates that LLMs excel at cancer type identification but face greater challenges in staging and prognosis; instruction tuning yields substantial gains, enabling smaller open-source models to approach or surpass larger closed models in several tasks. Path-llama3.1-8B and Path-GPT-4o-mini-FT emerge as top performers, with strong zero-shot results and efficient deployment via LoRA refinements and JSON-format outputs. The work highlights practical directions for clinical deployment, including privacy-preserving, resource-efficient models and potential enhancements through retrieval-augmented generation and multimodal data integration. Collectively, the findings support scalable, data-driven pathology analysis and pave the way for broader, privacy-conscious AI-assisted oncology workflows.

Abstract

Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.

Paper Structure

This paper contains 14 sections, 18 figures.

Figures (18)

  • Figure 1: (A) Heatmap illustrating cancer identification accuracy across all diseases. (B) Cancer identification accuracy metrics. (C) Cancer identification F1-score metrics.
  • Figure 2: (A) Error rates of identifying cancer types. (B) Error rates of identifying AJCC Stage. (C) Common mistakes made by LLMs. (D) Training loss curves for the instruction-tuned models, plotted only for 6,000 steps, although GPT-4o-mini was instruction-tuned for 17,000 steps. (E) Distribution of Disease-Specific Survival Time in years.
  • Figure 3: (A) Heatmap illustrating AJCC stage identification accuracy across all diseases. (B) AJCC stage identification accuracy metrics. (C) AJCC stage identification F1-score metrics.
  • Figure 4: Example of how GPT-4o performs using Chain of thought in comparison to the standard AJCC Guidelines for Skin Cutaneous Melanoma
  • Figure 5: (A) Heatmap illustrating prognosis assessment accuracy across all diseases. (B) Prognosis assessment accuracy metrics. (C) Prognosis assessment F1-score metrics.
  • ...and 13 more figures