Table of Contents
Fetching ...

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Frank J. Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar

TL;DR

This narrative review assesses the current evaluation state for clinical summarization tasks and proposes future directions to address the resource constraints of expert human evaluation.

Abstract

Large Language Models have advanced clinical Natural Language Generation, creating opportunities to manage the volume of medical text. However, the high-stakes nature of medicine requires reliable evaluation, which remains a challenge. In this narrative review, we assess the current evaluation state for clinical summarization tasks and propose future directions to address the resource constraints of expert human evaluation.

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

TL;DR

This narrative review assesses the current evaluation state for clinical summarization tasks and proposes future directions to address the resource constraints of expert human evaluation.

Abstract

Large Language Models have advanced clinical Natural Language Generation, creating opportunities to manage the volume of medical text. However, the high-stakes nature of medicine requires reliable evaluation, which remains a challenge. In this narrative review, we assess the current evaluation state for clinical summarization tasks and propose future directions to address the resource constraints of expert human evaluation.
Paper Structure (18 sections, 5 figures)

This paper contains 18 sections, 5 figures.

Figures (5)

  • Figure 1: Pre-LLM Automated Evaluation Metric Taxonomy A structured organization of pre-LLM automated evaluation metrics categorized by their bases and the need for ground truth references. Those metrics that were built for or have been applied in the clinical domain are in bold.
  • Figure 2: Stages of Prompt Engineering LLMs as Judges The three different aspects of prompt engineering expanded upon in section 5. The three sections - Zero-Shot and In-Context Learning (ICL), Parameter Efficient Fine Tuning (PEFT), and PEFT with Human Aware Loss Function (HALO) - fit together into a larger schema for training and prompting an LLM to serve as an evaluator to complement human expert evaluators.
  • Figure 3: Anatomy of an Evaluator Prompt An evaluator prompt consists of three sections: Prompt, Information, and Evaluation. All three components are essential for an LLM serving as an evaluator. The Evaluator Prompt needs to instruct the LLM on the task (Prompt), provide the LLM will all the necessary information to make an evaluation (Information), and all the information that defines the guidelines and formatting of the evaluation (Evaluation).
  • Figure 4: Alignment Workflow: PPO v. DPO An overview of the processes for aligning an LLM through Reinforcement Learning Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO).
  • Figure 5: Human Aware Loss Functions (HALOs) from PPO to Present The development timeline for HALOs from the advent of Proximal Policy Optimization (PPO) in 2017 through 2024. HALOs are classified on an algorithmic basis and on their data requirements.