Table of Contents
Fetching ...

Iterative Prompt Refinement for Radiation Oncology Symptom Extraction Using Teacher-Student Large Language Models

Reza Khanmohammadi, Ahmed I Ghanem, Kyle Verdecchia, Ryan Hall, Mohamed Elshaikh, Benjamin Movsas, Hassan Bagher-Ebadian, Indrin Chetty, Mohammad M. Ghassemi, Kundan Thind

TL;DR

The paper presents a novel teacher-student framework where Mixtral (student) initially extracts radiotherapy-related symptoms from prostate cancer clinical notes, and GPT-4 (teacher) iteratively refines prompts across rounds and epochs to enhance performance. Evaluated on 294 single-symptom notes across 12 toxicity symptoms and 375 multi-symptom notes for validation, the method achieves notable improvements in accuracy, precision, recall, and F1 after refinement, with the strongest gains in single-symptom cases. The approach highlights automatic prompt engineering, zero-shot learning capabilities, and privacy advantages from local inference, while acknowledging limitations such as potential overfitting due to limited optimization data and the need for broader validation. Overall, the study demonstrates a promising, data-efficient pathway for improving clinical NLP tasks in radiation oncology through an automated teacher-student prompting regime with potential clinical impact.

Abstract

This study introduces a novel teacher-student architecture utilizing Large Language Models (LLMs) to improve prostate cancer radiotherapy symptom extraction from clinical notes. Mixtral, the student model, initially extracts symptoms, followed by GPT-4, the teacher model, which refines prompts based on Mixtral's performance. This iterative process involved 294 single symptom clinical notes across 12 symptoms, with up to 16 rounds of refinement per epoch. Results showed significant improvements in extracting symptoms from both single and multi-symptom notes. For 59 single symptom notes, accuracy increased from 0.51 to 0.71, precision from 0.52 to 0.82, recall from 0.52 to 0.72, and F1 score from 0.49 to 0.73. In 375 multi-symptom notes, accuracy rose from 0.24 to 0.43, precision from 0.6 to 0.76, recall from 0.24 to 0.43, and F1 score from 0.20 to 0.44. These results demonstrate the effectiveness of advanced prompt engineering in LLMs for radiation oncology use.

Iterative Prompt Refinement for Radiation Oncology Symptom Extraction Using Teacher-Student Large Language Models

TL;DR

The paper presents a novel teacher-student framework where Mixtral (student) initially extracts radiotherapy-related symptoms from prostate cancer clinical notes, and GPT-4 (teacher) iteratively refines prompts across rounds and epochs to enhance performance. Evaluated on 294 single-symptom notes across 12 toxicity symptoms and 375 multi-symptom notes for validation, the method achieves notable improvements in accuracy, precision, recall, and F1 after refinement, with the strongest gains in single-symptom cases. The approach highlights automatic prompt engineering, zero-shot learning capabilities, and privacy advantages from local inference, while acknowledging limitations such as potential overfitting due to limited optimization data and the need for broader validation. Overall, the study demonstrates a promising, data-efficient pathway for improving clinical NLP tasks in radiation oncology through an automated teacher-student prompting regime with potential clinical impact.

Abstract

This study introduces a novel teacher-student architecture utilizing Large Language Models (LLMs) to improve prostate cancer radiotherapy symptom extraction from clinical notes. Mixtral, the student model, initially extracts symptoms, followed by GPT-4, the teacher model, which refines prompts based on Mixtral's performance. This iterative process involved 294 single symptom clinical notes across 12 symptoms, with up to 16 rounds of refinement per epoch. Results showed significant improvements in extracting symptoms from both single and multi-symptom notes. For 59 single symptom notes, accuracy increased from 0.51 to 0.71, precision from 0.52 to 0.82, recall from 0.52 to 0.72, and F1 score from 0.49 to 0.73. In 375 multi-symptom notes, accuracy rose from 0.24 to 0.43, precision from 0.6 to 0.76, recall from 0.24 to 0.43, and F1 score from 0.20 to 0.44. These results demonstrate the effectiveness of advanced prompt engineering in LLMs for radiation oncology use.
Paper Structure (5 sections, 2 figures, 1 table, 1 algorithm)

This paper contains 5 sections, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Schematic representation of our proposed method.
  • Figure 2: The evolution of Mixtral's symptom extraction accuracy per different symptoms at different epochs.