A Large Language Model Pipeline for Breast Cancer Oncology

Tristen Pool; Dennis Trujillo

A Large Language Model Pipeline for Breast Cancer Oncology

Tristen Pool, Dennis Trujillo

TL;DR

This work demonstrates that fine-tuning large language models on a Duke MRI-derived clinical dataset and a clinical guidelines corpus, via a LangChain-based pipeline, can achieve high accuracy ($>0.85$) in predicting adjuvant radiation therapy and chemotherapy for breast cancer. The approach leverages GPT-3.5-Turbo for context retention, Babbage for efficient binary classification, and Davinci for guideline text generation, with a meticulously designed data-preprocessing and Q&A generation workflow. An error-analysis framework using a Wilson score interval suggests that, after adjusting for human error, the model could outperform oncologists in roughly $8.2%$ to $13.3%$ of cases, though validation through clinical studies is required. The findings indicate potential to broaden access to quality oncology care in community settings, while underscoring the need for clinical trials to confirm real-world effectiveness and safety.

Abstract

Large language models (LLMs) have demonstrated potential in the innovation of many disciplines. However, how they can best be developed for oncology remains underdeveloped. State-of-the-art OpenAI models were fine-tuned on a clinical dataset and clinical guidelines text corpus for two important cancer treatment factors, adjuvant radiation therapy and chemotherapy, using a novel Langchain prompt engineering pipeline. A high accuracy (0.85+) was achieved in the classification of adjuvant radiation therapy and chemotherapy for breast cancer patients. Furthermore, a confidence interval was formed from observational data on the quality of treatment from human oncologists to estimate the proportion of scenarios in which the model must outperform the original oncologist in its treatment prediction to be a better solution overall as 8.2% to 13.3%. Due to indeterminacy in the outcomes of cancer treatment decisions, future investigation, potentially a clinical trial, would be required to determine if this threshold was met by the models. Nevertheless, with 85% of U.S. cancer patients receiving treatment at local community facilities, these kinds of models could play an important part in expanding access to quality care with outcomes that lie, at minimum, close to a human oncologist.

A Large Language Model Pipeline for Breast Cancer Oncology

TL;DR

This work demonstrates that fine-tuning large language models on a Duke MRI-derived clinical dataset and a clinical guidelines corpus, via a LangChain-based pipeline, can achieve high accuracy (

) in predicting adjuvant radiation therapy and chemotherapy for breast cancer. The approach leverages GPT-3.5-Turbo for context retention, Babbage for efficient binary classification, and Davinci for guideline text generation, with a meticulously designed data-preprocessing and Q&A generation workflow. An error-analysis framework using a Wilson score interval suggests that, after adjusting for human error, the model could outperform oncologists in roughly

of cases, though validation through clinical studies is required. The findings indicate potential to broaden access to quality oncology care in community settings, while underscoring the need for clinical trials to confirm real-world effectiveness and safety.

Abstract

Paper Structure (15 sections, 3 equations, 5 figures, 1 table)

This paper contains 15 sections, 3 equations, 5 figures, 1 table.

Introduction
Datasets
Duke MRI
Clinical Guideline Corpus
Methods
GPT Models
GPT-3.5 Turbo
Babbage
DaVinci
LangChain
Duke Pipeline
Temperature Sensitivity Analysis
Clinical Guidelines Pipeline
Error Analysis
Discussion

Figures (5)

Figure 1: The overall architecture diagram. A. Clinical corpus text data, B. Simple text preprocessing, C. Langchain agent handles the decision making, D. The text is then segregated into Q & A pairs, E. Summarization step where non useful Q & A pairs are discarded, F. GPT-3 Davinci model trained on the Q & A pairs, G. The trained model, H. Inference done by doctors/patients/healthcare professionals.
Figure 2: The validation accuracy of the Babbage model fine-tuned for adjuvant radiation therapy classification and adjuvant chemotherapy, both trained for five epochs, plotted against the temperature of the model
Figure 3: Validation confusion matrices for Babbage adjuvant radiation therapy classification (n=181).
Figure 4: Validation confusion matrices for Babbage adjuvant chemotherapy classification (n=171)
Figure 5: Davinci chat model training loss (left) and training token accuracy (right)

A Large Language Model Pipeline for Breast Cancer Oncology

TL;DR

Abstract

A Large Language Model Pipeline for Breast Cancer Oncology

Authors

TL;DR

Abstract

Table of Contents

Figures (5)