Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

Reza Khanmohammadi; Ahmed I. Ghanem; Kyle Verdecchia; Ryan Hall; Mohamed Elshaikh; Benjamin Movsas; Hassan Bagher-Ebadian; Bing Luo; Indrin J. Chetty; Tuka Alhanai; Kundan Thind; Mohammad M. Ghassemi

Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

Reza Khanmohammadi, Ahmed I. Ghanem, Kyle Verdecchia, Ryan Hall, Mohamed Elshaikh, Benjamin Movsas, Hassan Bagher-Ebadian, Bing Luo, Indrin J. Chetty, Tuka Alhanai, Kundan Thind, Mohammad M. Ghassemi

TL;DR

The results highlight the potential of iterative refinement techniques to enhance the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.

Abstract

Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach. We employ a student-teacher architecture, utilizing Zephyr-7b-beta and Phi3-mini-128 as student models and GPT-4o as the teacher, to dynamically select between prompt refinement, Retrieval-Augmented Generation (RAG), and fine-tuning strategies. Our experiments on 294 clinical notes covering 12 post-radiotherapy toxicity symptoms demonstrate the effectiveness of this approach. The RAG method proved most efficient, improving average accuracy scores from 0.32 to 0.73 for Zephyr-7b-beta and from 0.40 to 0.87 for Phi3-mini-128 during refinement. In the test set, both models showed an approximate 0.20 increase in accuracy across symptoms. Notably, this improvement was achieved at a cost 45 times lower than GPT-4o for Zephyr and 79 times lower for Phi-3. These results highlight the potential of iterative refinement techniques in enhancing the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.

Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

TL;DR

Abstract

Paper Structure (21 sections, 3 figures)

This paper contains 21 sections, 3 figures.

Introduction
Experiments
Results
Conclusion and Discussion

Figures (3)

Figure 1: The diagram illustrates the iterative refinement method, involving a student model (Phi3-mini-128 or Zephyr-7B-beta) and a teacher model (GPT-4o). The process starts with the student model receiving clinical notes and a target symptom, generating initial labels and reasoning. The teacher model then assesses performance and decides between prompt refinement and fine-tuning. In prompt refinement, the teacher improves the prompt and adds RAG examples. In fine-tuning, the teacher selects relevant samples and sets hyperparameters for the student model. In the hybrid approach, the teacher model acts as an intelligent agent, dynamically deciding between prompt refinement and fine-tuning based on the student's performance and needs. The refined approach is iteratively applied, optimizing the student model's performance in symptom extraction.
Figure 2: Performance Comparison of Zephyr and Phi3 Models in Symptom Extraction. The line charts on the left-hand side represent the evolution of the Accuracy scores for different post-RT toxicity symptoms, with each line color-coded according to the 12 symptoms listed in the legend. The right-hand side line chart shows the average performance score across all toxicity symptoms at each time point, illustrating the difference in mean and standard deviation between the initial student model's performance and the final refined model's performance.
Figure 3: The left panels display initial (brighter colors) and refined (darker colors) performance scores (blue and green bars) and associated costs (red bars) for the Phi-3 and Zephyr models across different refinement techniques: Hybrid, Finetuned, RAG, and GPT-4o. Hatched bars represent F1-macro scores, while smooth bars indicate accuracy. Averages and standard deviations are calculated across 12 toxicity symptoms. The right panel illustrates the average Performance-Cost Ratio for refined Phi-3 and Zephyr models, showing performance scores and associated costs.

Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

TL;DR

Abstract

Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)