Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Niklas Mannhardt; Elizabeth Bondi-Kelly; Barbara Lam; Hussein Mozannar; Chloe O'Connell; Mercy Asiedu; Alejandro Buendia; Tatiana Urman; Irbaz B. Riaz; Catherine E. Ricciardi; Monica Agrawal; Marzyeh Ghassemi; David Sontag

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Niklas Mannhardt, Elizabeth Bondi-Kelly, Barbara Lam, Hussein Mozannar, Chloe O'Connell, Mercy Asiedu, Alejandro Buendia, Tatiana Urman, Irbaz B. Riaz, Catherine E. Ricciardi, Monica Agrawal, Marzyeh Ghassemi, David Sontag

TL;DR

This mixed-methods study evaluates an end-to-end LLM-assisted tool that augments patient-facing clinical notes with Definitions, Simplification, FAQ, Key Information, and To-do List outputs to improve comprehension among breast cancer patients. Using real and synthetic notes, N=200 survey participants and N=7 interviews, the study shows that Select and All augmentation levels significantly enhance action understanding and self-reported comprehension and confidence, with effects more pronounced for real notes. A detailed error taxonomy and automated readability metrics reveal that while augmentations improve readability, certain definitions and real-note augmentations introduce potentially harmful or misleading errors, underscoring the need for clinician review and cautious deployment. The findings support careful, participatory design of patient-facing AI tools to empower patients while maintaining trust and safety in clinical communication.

Abstract

Large language models (LLMs) have immense potential to make information more accessible, particularly in medicine, where complex medical jargon can hinder patient comprehension of clinical notes. We developed a patient-facing tool using LLMs to make clinical notes more readable by simplifying, extracting information from, and adding context to the notes. We piloted the tool with clinical notes donated by patients with a history of breast cancer and synthetic notes from a clinician. Participants (N=200, healthy, female-identifying patients) were randomly assigned three clinical notes in our tool with varying levels of augmentations and answered quantitative and qualitative questions evaluating their understanding of follow-up actions. Augmentations significantly increased their quantitative understanding scores. In-depth interviews were conducted with participants (N=7, patients with a history of breast cancer), revealing both positive sentiments about the augmentations and concerns about AI. We also performed a qualitative clinician-driven analysis of the model's error modes.

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

TL;DR

Abstract

Paper Structure (23 sections, 13 figures, 2 tables)

This paper contains 23 sections, 13 figures, 2 tables.

Introduction
Related Work
Clinical Notes Augmentations
Augmentation Types
User Interface
Experimental Setup
Clinical Notes Collection
Survey Study Design
Interview Study Design
Error Analysis of Augmentations
Statistical Analysis
Results
Study Population
Error Analysis
Quantitative Analysis
...and 8 more sections

Figures (13)

Figure 1: Our user interface to assist in the comprehension of clinical notes, containing (from left to right) an option to toggle between the original clinical note and simplified augmentation, an optional pane with the FAQ, key information, and to-do list augmentations, and finally, an optional definitions augmentation pane. Headings have been shown in the same formatting as visualizations in the rest of the paper.
Figure 2: Overview of mixed methods evaluation of LLM augmentations for clinical notes in our work. This evaluation was conducted based on both synthetic and real notes (left of figure), and two sets of augmentations were studied: select and all. Select included simplification and definition augmentations, and All included these as well as FAQ, Key Information, and To-Do List augmentations. Augmentation evaluation was conducted using mixed methods (see list under Augmentation Analysis). Researchers on the team analyzed the results of the first several methods, and clinical research partners focused on clinical review (person with + icon). Readability was analyzed automatically. We used methods including both surveys (pen icon) and interviews (chat bubble icon), as well as clinical review (person with + icon) and automated analyses (computer monitor icon). Several of these evaluations focused on identifying errors (those highlighted in last column).
Figure 3: Examples of the augmentations considered in our interface using a sample excerpt from a synthetic clinical note. This includes the augmentations of simplification, definitions, frequently asked questions, key information, and to-do list.
Figure 4: Error analysis of GPT-4 augmentations. (a) (b) and (c) show the number of errors in different categories, as determined by clinical reviewers. In (a), we have types of errors, such as hallucinations. The most frequent error was when GPT-4 assumed a certain context, e.g., assuming that a definition is specific to breast cancer when it truly is not. In (b), this is grouped by augmented clinical notes, where P represents a participant's donated note, and S represents a synthetic note. There are very few errors for synthetic notes and a great deal of variation in the number of errors in donated clinical notes. In (c), the number of errors is shown in each of the augmentations, with definitions containing the most errors. (d) and (e) are the result of automated analysis of GPT-4 augmentations. The error bars on subfigures (d) and (e) represent confidence intervals computed using t-scores, note that we omit from the analysis empty note augmentations. (d) contains the percentage of words that are found in the Carnegie Mellon Pronouncing Dictionary Corpus Reader. Control has the least, as there are potentially more acronyms and jargon. (e) shows the Flesch Kincaid grade level score, which is generally comparable or less than control in the augmentations, except for definitions. From interviews, we know that definitions still do contain jargon that can be confusing.
Figure 5: Results from our web-based survey for participants' action understanding score for the Control, All, and Select conditions. We show the scores for all notes (a), synthetic and real notes (b), and scores across multiple subgroups (c). Generally, Select (definitions and simplification augmentations) leads to the best action understanding score and improvement over control. We note that both the All and Select conditions significantly increase action understanding score on the aggregate.
...and 8 more figures

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

TL;DR

Abstract

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Authors

TL;DR

Abstract

Table of Contents

Figures (13)