Table of Contents
Fetching ...

Benchmarking LLMs and SLMs for patient reported outcomes

Matteo Marengo, Jarod Lévy, Jean-Emmanuel Bibault

TL;DR

This study benchmarks local small language models (SLMs) against a large language model (GPT-4) for summarizing patient-reported outcomes (PROs) in radiotherapy using PRO-CTCAE data. It introduces a keyword-guided, two-sentence summarization prompt and evaluates models with custom metrics including a severity metric $S = \frac{K_s}{K_p}$, Recall $R = \frac{K_s}{K_p + K_{fn}}$, and Cohen's Kappa $\kappa = \frac{P_o - P_e}{1 - P_e}$, plus a GPT-4 grade score. Results show GPT-4 consistently outperforms SLMs (mean severity ~0.83, recall ~0.56, kappa ~0.34) while SLMs exhibit higher recall but poorer agreement and greater variability, suggesting a gap of about 25% in overall performance. The authors discuss the trade-offs between privacy, efficiency, and accuracy, and propose hybrid or fine-tuned approaches to enable safer, scalable, privacy-preserving AI support in clinical workflows.

Abstract

LLMs have transformed the execution of numerous tasks, including those in the medical domain. Among these, summarizing patient-reported outcomes (PROs) into concise natural language reports is of particular interest to clinicians, as it enables them to focus on critical patient concerns and spend more time in meaningful discussions. While existing work with LLMs like GPT-4 has shown impressive results, real breakthroughs could arise from leveraging SLMs as they offer the advantage of being deployable locally, ensuring patient data privacy and compliance with healthcare regulations. This study benchmarks several SLMs against LLMs for summarizing patient-reported Q\&A forms in the context of radiotherapy. Using various metrics, we evaluate their precision and reliability. The findings highlight both the promise and limitations of SLMs for high-stakes medical tasks, fostering more efficient and privacy-preserving AI-driven healthcare solutions.

Benchmarking LLMs and SLMs for patient reported outcomes

TL;DR

This study benchmarks local small language models (SLMs) against a large language model (GPT-4) for summarizing patient-reported outcomes (PROs) in radiotherapy using PRO-CTCAE data. It introduces a keyword-guided, two-sentence summarization prompt and evaluates models with custom metrics including a severity metric , Recall , and Cohen's Kappa , plus a GPT-4 grade score. Results show GPT-4 consistently outperforms SLMs (mean severity ~0.83, recall ~0.56, kappa ~0.34) while SLMs exhibit higher recall but poorer agreement and greater variability, suggesting a gap of about 25% in overall performance. The authors discuss the trade-offs between privacy, efficiency, and accuracy, and propose hybrid or fine-tuned approaches to enable safer, scalable, privacy-preserving AI support in clinical workflows.

Abstract

LLMs have transformed the execution of numerous tasks, including those in the medical domain. Among these, summarizing patient-reported outcomes (PROs) into concise natural language reports is of particular interest to clinicians, as it enables them to focus on critical patient concerns and spend more time in meaningful discussions. While existing work with LLMs like GPT-4 has shown impressive results, real breakthroughs could arise from leveraging SLMs as they offer the advantage of being deployable locally, ensuring patient data privacy and compliance with healthcare regulations. This study benchmarks several SLMs against LLMs for summarizing patient-reported Q\&A forms in the context of radiotherapy. Using various metrics, we evaluate their precision and reliability. The findings highlight both the promise and limitations of SLMs for high-stakes medical tasks, fostering more efficient and privacy-preserving AI-driven healthcare solutions.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Sum-up of the task wanted by the radiotherapist. The classical procedure is first presented and then the enhancement being brought by LLMs.
  • Figure 2: LLM Evaluation Pipeline. Presentation of the different steps to evaluate the LLMs and SLMs.
  • Figure 3: Qualitative performance comparison between GPT-4 and Mistral 7B. Left: symptoms with the related severity. Right: summary for both GPT-4 and Mistral 7B
  • Figure 4: Violin Plot of the Severity Score - Comparison of the LMs
  • Figure 5: Violin Plot of the Recall Score - Comparison of the LMs
  • ...and 2 more figures