Classifying Cancer Stage with Open-Source Clinical Large Language Models

Chia-Hsuan Chang; Mary M. Lucas; Grace Lu-Yao; Christopher C. Yang

Classifying Cancer Stage with Open-Source Clinical Large Language Models

Chia-Hsuan Chang, Mary M. Lucas, Grace Lu-Yao, Christopher C. Yang

TL;DR

The paper tackles the challenge of extracting $pTNM$ cancer staging from unstructured pathology reports without labeled data by evaluating open-source clinical LLMs (Llama-2-70b-chat, ClinicalCamel-70B, Med42-70B) against a fine-tuned Clinical-BigBird baseline on TCGA pathology reports. It explores zero-shot, zero-shot chain-of-thought, and few-shot prompting, finding that prompting can enable LLMs to achieve competitive macro $F1$ scores for $N$ and $M$, with $T$ remaining best captured by the fine-tuned baseline. The study highlights the sensitivity of LLM performance to prompt design and institutional text variation, and demonstrates that locally hosted models can extract clinically relevant staging information without additional training data. These results support broader use of prompt engineering with open-source LLMs for clinical NLP tasks, while underscoring the need for cancer-type-specific prompts and robust evaluation frameworks.

Abstract

Cancer stage classification is important for making treatment and care management plans for oncology patients. Information on staging is often included in unstructured form in clinical, pathology, radiology and other free-text reports in the electronic health record system, requiring extensive work to parse and obtain. To facilitate the extraction of this information, previous NLP approaches rely on labeled training datasets, which are labor-intensive to prepare. In this study, we demonstrate that without any labeled training data, open-source clinical large language models (LLMs) can extract pathologic tumor-node-metastasis (pTNM) staging information from real-world pathology reports. Our experiments compare LLMs and a BERT-based model fine-tuned using the labeled data. Our findings suggest that while LLMs still exhibit subpar performance in Tumor (T) classification, with the appropriate adoption of prompting strategies, they can achieve comparable performance on Metastasis (M) classification and improved performance on Node (N) classification.

Classifying Cancer Stage with Open-Source Clinical Large Language Models

TL;DR

The paper tackles the challenge of extracting

cancer staging from unstructured pathology reports without labeled data by evaluating open-source clinical LLMs (Llama-2-70b-chat, ClinicalCamel-70B, Med42-70B) against a fine-tuned Clinical-BigBird baseline on TCGA pathology reports. It explores zero-shot, zero-shot chain-of-thought, and few-shot prompting, finding that prompting can enable LLMs to achieve competitive macro

scores for

and

, with

remaining best captured by the fine-tuned baseline. The study highlights the sensitivity of LLM performance to prompt design and institutional text variation, and demonstrates that locally hosted models can extract clinically relevant staging information without additional training data. These results support broader use of prompt engineering with open-source LLMs for clinical NLP tasks, while underscoring the need for cancer-type-specific prompts and robust evaluation frameworks.

Abstract

Paper Structure (15 sections, 4 equations, 7 tables)

This paper contains 15 sections, 4 equations, 7 tables.

Introduction
Literature Review
Materials and Methods
Dataset
Benchmark
Large Language Model
Prompting strategy
Performance Metric
Result
Performance Comparison: Benchmark and LLMs with Zero-shot Prompting
Performance Comparison: LLMs with Different Prompting Strategies
Performance Comparison by Cancer Type
Discussion
Strengths and Limitations
Conclusion

Classifying Cancer Stage with Open-Source Clinical Large Language Models

TL;DR

Abstract

Classifying Cancer Stage with Open-Source Clinical Large Language Models

Authors

TL;DR

Abstract

Table of Contents