Table of Contents
Fetching ...

CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations

Subash Neupane, Himanshu Tripathi, Shaswata Mitra, Sean Bozorgzad, Sudip Mittal, Shahram Rahimi, Amin Amirlatifi

TL;DR

ClinicSum addresses automatic SOAP-form clinical summary generation from doctor–patient conversations by coupling a retrieval-based filtering stage with a fine-tuned language-model generator. The system is trained on a SME-validated dataset of 1,473 conversation–summary pairs derived from FigShare and MTS-Dialog, and it uses an ensemble of sparse and dense retrieval with Reciprocal Rank Fusion to pass concise, relevant context to a fine-tuned PLM. Automatic metrics (ROUGE, BertScore) and expert human evaluations show ClinicSum with open-source models (notably LLAMA-3-8B) outperform GPT-based approaches in both lexical and semantic fidelity, while reducing hallucinations through token-filtering. The work demonstrates the practical potential of deploying efficient, domain-targeted summarization in clinical settings and points to future expansion of data, scalability, and bias/hallucination mitigation as priority directions.

Abstract

This paper presents ClinicSum, a novel framework designed to automatically generate clinical summaries from patient-doctor conversations. It utilizes a two-module architecture: a retrieval-based filtering module that extracts Subjective, Objective, Assessment, and Plan (SOAP) information from conversation transcripts, and an inference module powered by fine-tuned Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to generate abstracted clinical summaries. To fine-tune the PLM, we created a training dataset of consisting 1,473 conversations-summaries pair by consolidating two publicly available datasets, FigShare and MTS-Dialog, with ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's effectiveness is evaluated through both automatic metrics (e.g., ROUGE, BERTScore) and expert human assessments. Results show that ClinicSum outperforms state-of-the-art PLMs, demonstrating superior precision, recall, and F-1 scores in automatic evaluations and receiving high preference from SMEs in human assessment, making it a robust solution for automated clinical summarization.

CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations

TL;DR

ClinicSum addresses automatic SOAP-form clinical summary generation from doctor–patient conversations by coupling a retrieval-based filtering stage with a fine-tuned language-model generator. The system is trained on a SME-validated dataset of 1,473 conversation–summary pairs derived from FigShare and MTS-Dialog, and it uses an ensemble of sparse and dense retrieval with Reciprocal Rank Fusion to pass concise, relevant context to a fine-tuned PLM. Automatic metrics (ROUGE, BertScore) and expert human evaluations show ClinicSum with open-source models (notably LLAMA-3-8B) outperform GPT-based approaches in both lexical and semantic fidelity, while reducing hallucinations through token-filtering. The work demonstrates the practical potential of deploying efficient, domain-targeted summarization in clinical settings and points to future expansion of data, scalability, and bias/hallucination mitigation as priority directions.

Abstract

This paper presents ClinicSum, a novel framework designed to automatically generate clinical summaries from patient-doctor conversations. It utilizes a two-module architecture: a retrieval-based filtering module that extracts Subjective, Objective, Assessment, and Plan (SOAP) information from conversation transcripts, and an inference module powered by fine-tuned Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to generate abstracted clinical summaries. To fine-tune the PLM, we created a training dataset of consisting 1,473 conversations-summaries pair by consolidating two publicly available datasets, FigShare and MTS-Dialog, with ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's effectiveness is evaluated through both automatic metrics (e.g., ROUGE, BERTScore) and expert human assessments. Results show that ClinicSum outperforms state-of-the-art PLMs, demonstrating superior precision, recall, and F-1 scores in automatic evaluations and receiving high preference from SMEs in human assessment, making it a robust solution for automated clinical summarization.

Paper Structure

This paper contains 22 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: A graphical overview of the ClinicSum. P denotes the Patient and D denotes the Doctor in the conversation transcript. S, O, A, and P refer to the Subjective, Objective, Assessment, and Plan components of the clinical summary.
  • Figure 2: A is graphical illustration of the ClinicSum architecture. It comprises two modules: retrieved-based filtering and inference. B represents patient-doctor conversation, and C represents generated clinical summary. [...] (used for brevity) indicates that there is more textual information.
  • Figure 3: An example of an Alpaca prompt.
  • Figure 4: Heatmap illustrating the preferences between summaries generated by ClinicSum and GPT, along with ties indicating equal preference between the two.
  • Figure 5: Radar chart illustrating how different models compare in terms of two key metrics: the average number of tokens and F-1 scores of BertScore.