Table of Contents
Fetching ...

Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology

Meiyun Cao, Shaw Hu, Jason Sharp, Edward Clouser, Jason Holmes, Linda L. Lam, Xiaoning Ding, Diego Santos Toesca, Wendy S. Lindholm, Samir H. Patel, Sujay A. Vora, Peilong Wang, Wei Liu

TL;DR

This study evaluates a locally hosted Llama 3.1 405B to automate the summarization of CT simulation orders in radiation oncology. By creating category-specific prompts and a therapist-verified ground truth from 607 matched orders, the authors assess AI-generated summaries against expert standards. The model achieves an average accuracy of 98.59% across seven categories, with high consistency at a temperature of 0.1 and three prompt trials, though some category-specific complexities and ambiguities affect performance. The findings suggest that LLM-assisted summarization can improve consistency and reduce clinician workload while preserving privacy, supporting integration into the CT simulation workflow.

Abstract

Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference. Results: About 98% of the LLM-generated summaries aligned with the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site. Conclusions: This study demonstrated the high precision and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs have great potential to help with this task, reduce the workload of therapists and improve workflow efficiency.

Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology

TL;DR

This study evaluates a locally hosted Llama 3.1 405B to automate the summarization of CT simulation orders in radiation oncology. By creating category-specific prompts and a therapist-verified ground truth from 607 matched orders, the authors assess AI-generated summaries against expert standards. The model achieves an average accuracy of 98.59% across seven categories, with high consistency at a temperature of 0.1 and three prompt trials, though some category-specific complexities and ambiguities affect performance. The findings suggest that LLM-assisted summarization can improve consistency and reduce clinician workload while preserving privacy, supporting integration into the CT simulation workflow.

Abstract

Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference. Results: About 98% of the LLM-generated summaries aligned with the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site. Conclusions: This study demonstrated the high precision and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs have great potential to help with this task, reduce the workload of therapists and improve workflow efficiency.

Paper Structure

This paper contains 14 sections, 4 figures.

Figures (4)

  • Figure 1: Pre-processing the dataset by matching the exam dates. The raw dataset is processed by matching the exam date of the CT simulation orders with the date of the corresponding therapist-wrote notes, retaining only matched CT simulation orders and discarding unmatched CT simulation orders.
  • Figure 2: Categorization and integration of the dataset. The workflow demonstrates how data is systematically categorized by treatment modalities (such as proton or photon therapies) and disease sites, then data is categorized into 7 groups, ensuring data quality and consistency for analysis.
  • Figure 3: Prompt engineering and evaluation process. The AI output generated from the customized prompt undergoes continuous evaluation until it meets the initial evaluation standards. During this process, the prompt is iteratively refined after each failed evaluation.
  • Figure 4: The accuracy of the AI generated summaries across 7 treatment categories. Each color in the circular figure represents a specific category, with the corresponding accuracy of the AI-generated summaries for that category shown in the same color.