Table of Contents
Fetching ...

Building a Chinese Medical Dialogue System: Integrating Large-scale Corpora and Novel Models

Xinyuan Wang, Haozhou Li, Dingfang Zheng, Qinke Peng

TL;DR

This work tackles data scarcity and domain-knowledge gaps in Chinese medical dialogue by building the Large-scale Chinese Medical Dialogue Corpora (LCMDC) and delivering two systems: a BERT-based intelligent triage framework (combining supervised learning and prompt learning) and a GPT-2-based medical consultation model enhanced with a domain knowledge graph. It introduces three components of LCMDC (coarse-grained triage, fine-grained diagnosis, and medical consultation) and demonstrates improved performance over baselines through domain-adaptive pre-training and knowledge augmentation. Key contributions include scalable medical datasets, a novel two-track triage approach, and a knowledge-augmented dialogue generator with improved evaluation metrics across triage and QA tasks. The results have practical significance for scalable online medical triage and doctor–patient consultation, with potential for personalized and aligned healthcare AI services.

Abstract

The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.

Building a Chinese Medical Dialogue System: Integrating Large-scale Corpora and Novel Models

TL;DR

This work tackles data scarcity and domain-knowledge gaps in Chinese medical dialogue by building the Large-scale Chinese Medical Dialogue Corpora (LCMDC) and delivering two systems: a BERT-based intelligent triage framework (combining supervised learning and prompt learning) and a GPT-2-based medical consultation model enhanced with a domain knowledge graph. It introduces three components of LCMDC (coarse-grained triage, fine-grained diagnosis, and medical consultation) and demonstrates improved performance over baselines through domain-adaptive pre-training and knowledge augmentation. Key contributions include scalable medical datasets, a novel two-track triage approach, and a knowledge-augmented dialogue generator with improved evaluation metrics across triage and QA tasks. The results have practical significance for scalable online medical triage and doctor–patient consultation, with potential for personalized and aligned healthcare AI services.

Abstract

The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
Paper Structure (34 sections, 12 equations, 7 figures, 6 tables)

This paper contains 34 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Age and Gender Distribution in our Corpus.
  • Figure 2: The Proposed Supervised Learning Classification Method.
  • Figure 3: The Proposed Prompt Structure for Classification.
  • Figure 4: The Medical Dialogue System Framework.
  • Figure 5: Ablation Results of Supervised Classification Method.
  • ...and 2 more figures