Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs
Rachit Saluja, Jacob Rosenthal, Yoav Artzi, David J. Pisapia, Benjamin L. Liechty, Mert R. Sabuncu
TL;DR
This study benchmarks large language models for extracting cancer type, AJCC stage, and prognosis from unstructured pathology reports using TCGA-derived data. It demonstrates that LLMs excel at cancer type identification but face greater challenges in staging and prognosis; instruction tuning yields substantial gains, enabling smaller open-source models to approach or surpass larger closed models in several tasks. Path-llama3.1-8B and Path-GPT-4o-mini-FT emerge as top performers, with strong zero-shot results and efficient deployment via LoRA refinements and JSON-format outputs. The work highlights practical directions for clinical deployment, including privacy-preserving, resource-efficient models and potential enhancements through retrieval-augmented generation and multimodal data integration. Collectively, the findings support scalable, data-driven pathology analysis and pave the way for broader, privacy-conscious AI-assisted oncology workflows.
Abstract
Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.
