Table of Contents
Fetching ...

Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes

Yu-Wen Chen, Julia Hirschberg

TL;DR

This study evaluates how state-of-the-art doctor-patient conversation summarization models generalize to out-of-domain data, comparing a general configuration with a SOAP-oriented approach across MTS-Dialog and ACI-BENCH datasets. It analyzes LM-based fine-tuning and GPT-based methods, using LIWC and hallucination analyses to characterize information loss and error modes. Key findings show substantial cross-dataset performance drops, with objective information particularly at risk under general models, and SOAP-oriented models reducing omissions but introducing hallucinations in the AP section; GPT-based methods exhibit lower hallucinations yet may still introduce non-evident data. The work informs design choices for robust clinical summarization in real-world settings and highlights directions for mitigating hallucination while preserving essential data.

Abstract

Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan (P) notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.

Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes

TL;DR

This study evaluates how state-of-the-art doctor-patient conversation summarization models generalize to out-of-domain data, comparing a general configuration with a SOAP-oriented approach across MTS-Dialog and ACI-BENCH datasets. It analyzes LM-based fine-tuning and GPT-based methods, using LIWC and hallucination analyses to characterize information loss and error modes. Key findings show substantial cross-dataset performance drops, with objective information particularly at risk under general models, and SOAP-oriented models reducing omissions but introducing hallucinations in the AP section; GPT-based methods exhibit lower hallucinations yet may still introduce non-evident data. The work informs design choices for robust clinical summarization in real-world settings and highlights directions for mitigating hallucination while preserving essential data.

Abstract

Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan (P) notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.
Paper Structure (18 sections, 6 figures, 7 tables)

This paper contains 18 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Dataset examples. Samples in the MTS-Dialog dataset have a section header that indicates the category of the annotation and the section text, which is the main content of the notes. The samples in the ACI-BENCH dataset have one full note, where each section is separated by bold title text.
  • Figure 2: Illustration of the general and SOAP-oriented configurations.
  • Figure 3: Analysis of fine-tuning LM-based general model.
  • Figure 4: Analysis of fine-tuning LM-based SOAP-oriented model.
  • Figure 5: LIWC analysis of SOAP notes. Note that this result is calculated using all samples (i.e., training, validation, and testing sets), rather than using only the testing set as experiments on model performance. In addition, for simplicity and visualization purpose, we only show that LIWC categories that have a higher association with healthcare. The correlations between the two data sets are 0.93, 0.95, and 0.77 in S, O, and AP, respectively.
  • ...and 1 more figures