MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Kuluhan Binici; Abhinav Ramesh Kashyap; Viktor Schlegel; Andy T. Liu; Vijay Prakash Dwivedi; Thanh-Tung Nguyen; Xiaoxue Gao; Nancy F. Chen; Stefan Winkler

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Kuluhan Binici, Abhinav Ramesh Kashyap, Viktor Schlegel, Andy T. Liu, Vijay Prakash Dwivedi, Thanh-Tung Nguyen, Xiaoxue Gao, Nancy F. Chen, Stefan Winkler

TL;DR

This paper addresses the vulnerability of medical-dialogue summarization to ASR transcription errors in low-resource domains. It introduces MEDSAGE, a pipeline that uses in-context learning with large language models to generate ASR-like synthetic dialogues, guided by ASR error profiling and an error-tagging system to align with real-world noise distributions. Through data augmentation with synthetic noisy transcripts, MEDSAGE improves robustness of summarization models against ASR errors and demonstrates realism of the generated noise via qualitative and quantitative analyses. The approach enables controllable, privacy-conscious augmentation for high-stakes medical NLP tasks and shows promising improvements on medical dialogue summarization benchmarks.

Abstract

Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Dialogue Summarization
ASR Error Correction
Data Augmentation
MEDSAGE
Error Generation using In-context Learning
Controlled Generation
ASR Error Profiling:
Tagging System
Experiment Settings
ASR models:
Large Language Models:
Datasets:
Evaluation Metrics:
...and 10 more sections

Figures (5)

Figure 1: Overview of our MEDSAGE pipeline. First in-context examples are constructed and the error profile of the target ASR model is inferred. Later, the error profile, in-context examples and inputs dialogues are processed by the LLM model to generate noisy dialogues.
Figure 2: ASR errors of different models. The breakdown shows the different types of errors made by the model.
Figure 3: Comparison of an ASR transcription and its LLM-generated counterpart produced by MEDSAGE. The words highlighted in yellow are errors at common word indexes among both transcripts, while those that are highlighted in red and green indicate unique errors of ASR and MEDSAGE.
Figure 4: Similarities between ASR-generated transcriptions (rows) and LLM-generated synthetic transcriptions (columns) with respect to F1, Rouge-L, and WER metrics. The highest similarities are observed on the diagonals indicating overlap between corresponding ASR- and LLM-generated transcriptions.
Figure 5: Change in the downstream summarization quality w.r.t. F1 and RougeL metrics as the rate of error tags used to generate synthetic dialogues increases. In-context examples were taken from Whisper-tiny transcribed dialogues. Green dots mark the scores of real ASR transcriptions.

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

TL;DR

Abstract

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Authors

TL;DR

Abstract

Table of Contents

Figures (5)