Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

Emma Croxford; Yanjun Gao; Nicholas Pellegrino; Karen K. Wong; Graham Wills; Elliot First; Frank J. Liao; Cherodeep Goswami; Brian Patterson; Majid Afshar

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Frank J. Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar

TL;DR

This narrative review assesses the current evaluation state for clinical summarization tasks and proposes future directions to address the resource constraints of expert human evaluation.

Abstract

Large Language Models have advanced clinical Natural Language Generation, creating opportunities to manage the volume of medical text. However, the high-stakes nature of medicine requires reliable evaluation, which remains a challenge. In this narrative review, we assess the current evaluation state for clinical summarization tasks and propose future directions to address the resource constraints of expert human evaluation.

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

TL;DR

This narrative review assesses the current evaluation state for clinical summarization tasks and proposes future directions to address the resource constraints of expert human evaluation.

Abstract

Paper Structure (18 sections, 5 figures)

This paper contains 18 sections, 5 figures.

Abstract
Introduction
Human Evaluations in Electronic Health Record Documentation
Criteria for Human Evaluations
Analysis of Human Evaluations
Drawbacks of Human Evaluations
Pre-LLM Automated Evaluations
Categories of Automated Evaluation
Drawbacks of Automated Metrics
FUTURE DIRECTIONS: LLMs as Evaluators to Complement Human Expert Evaluators: Prompt Engineering LLMs as Judges
Zero-Shot and In-Context Learning
Parameter Efficient Fine-Tuning
Parameter Efficient Fine-Tuning with Human-Aware Loss Function
Drawbacks of LLMs as Evaluators
Evaluation Needs for the Clinical Domain
...and 3 more sections

Figures (5)

Figure 1: Pre-LLM Automated Evaluation Metric Taxonomy A structured organization of pre-LLM automated evaluation metrics categorized by their bases and the need for ground truth references. Those metrics that were built for or have been applied in the clinical domain are in bold.
Figure 2: Stages of Prompt Engineering LLMs as Judges The three different aspects of prompt engineering expanded upon in section 5. The three sections - Zero-Shot and In-Context Learning (ICL), Parameter Efficient Fine Tuning (PEFT), and PEFT with Human Aware Loss Function (HALO) - fit together into a larger schema for training and prompting an LLM to serve as an evaluator to complement human expert evaluators.
Figure 3: Anatomy of an Evaluator Prompt An evaluator prompt consists of three sections: Prompt, Information, and Evaluation. All three components are essential for an LLM serving as an evaluator. The Evaluator Prompt needs to instruct the LLM on the task (Prompt), provide the LLM will all the necessary information to make an evaluation (Information), and all the information that defines the guidelines and formatting of the evaluation (Evaluation).
Figure 4: Alignment Workflow: PPO v. DPO An overview of the processes for aligning an LLM through Reinforcement Learning Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO).
Figure 5: Human Aware Loss Functions (HALOs) from PPO to Present The development timeline for HALOs from the advent of Proximal Policy Optimization (PPO) in 2017 through 2024. HALOs are classified on an algorithmic basis and on their data requirements.

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

TL;DR

Abstract

Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review

Authors

TL;DR

Abstract

Table of Contents

Figures (5)