Table of Contents
Fetching ...

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Kuluhan Binici, Abhinav Ramesh Kashyap, Viktor Schlegel, Andy T. Liu, Vijay Prakash Dwivedi, Thanh-Tung Nguyen, Xiaoxue Gao, Nancy F. Chen, Stefan Winkler

TL;DR

This paper addresses the vulnerability of medical-dialogue summarization to ASR transcription errors in low-resource domains. It introduces MEDSAGE, a pipeline that uses in-context learning with large language models to generate ASR-like synthetic dialogues, guided by ASR error profiling and an error-tagging system to align with real-world noise distributions. Through data augmentation with synthetic noisy transcripts, MEDSAGE improves robustness of summarization models against ASR errors and demonstrates realism of the generated noise via qualitative and quantitative analyses. The approach enables controllable, privacy-conscious augmentation for high-stakes medical NLP tasks and shows promising improvements on medical dialogue summarization benchmarks.

Abstract

Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

TL;DR

This paper addresses the vulnerability of medical-dialogue summarization to ASR transcription errors in low-resource domains. It introduces MEDSAGE, a pipeline that uses in-context learning with large language models to generate ASR-like synthetic dialogues, guided by ASR error profiling and an error-tagging system to align with real-world noise distributions. Through data augmentation with synthetic noisy transcripts, MEDSAGE improves robustness of summarization models against ASR errors and demonstrates realism of the generated noise via qualitative and quantitative analyses. The approach enables controllable, privacy-conscious augmentation for high-stakes medical NLP tasks and shows promising improvements on medical dialogue summarization benchmarks.

Abstract

Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.
Paper Structure (25 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our MEDSAGE pipeline. First in-context examples are constructed and the error profile of the target ASR model is inferred. Later, the error profile, in-context examples and inputs dialogues are processed by the LLM model to generate noisy dialogues.
  • Figure 2: ASR errors of different models. The breakdown shows the different types of errors made by the model.
  • Figure 3: Comparison of an ASR transcription and its LLM-generated counterpart produced by MEDSAGE. The words highlighted in yellow are errors at common word indexes among both transcripts, while those that are highlighted in red and green indicate unique errors of ASR and MEDSAGE.
  • Figure 4: Similarities between ASR-generated transcriptions (rows) and LLM-generated synthetic transcriptions (columns) with respect to F1, Rouge-L, and WER metrics. The highest similarities are observed on the diagonals indicating overlap between corresponding ASR- and LLM-generated transcriptions.
  • Figure 5: Change in the downstream summarization quality w.r.t. F1 and RougeL metrics as the rate of error tags used to generate synthetic dialogues increases. In-context examples were taken from Whisper-tiny transcribed dialogues. Green dots mark the scores of real ASR transcriptions.