Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes
Yu-Wen Chen, Julia Hirschberg
TL;DR
This study evaluates how state-of-the-art doctor-patient conversation summarization models generalize to out-of-domain data, comparing a general configuration with a SOAP-oriented approach across MTS-Dialog and ACI-BENCH datasets. It analyzes LM-based fine-tuning and GPT-based methods, using LIWC and hallucination analyses to characterize information loss and error modes. Key findings show substantial cross-dataset performance drops, with objective information particularly at risk under general models, and SOAP-oriented models reducing omissions but introducing hallucinations in the AP section; GPT-based methods exhibit lower hallucinations yet may still introduce non-evident data. The work informs design choices for robust clinical summarization in real-world settings and highlights directions for mitigating hallucination while preserving essential data.
Abstract
Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan (P) notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.
