Table of Contents
Fetching ...

LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them

Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Xiang Wan, Feng Jiang, Benyou Wang

TL;DR

The paper reframes AI in healthcare from autonomous patient consultations to doctor-centered assistance, arguing that clinicians must oversee AI outputs to ensure safety. It develops DoctorFLAN, a Chinese medical dataset of ~92K samples across 22 tasks and 27 specialties, plus DoctorFLAN-test and DotaBench for evaluation, and introduces DotaGPT trained on DoctorFLAN. Through automatic and human evaluations across a suite of baselines, it shows that doctor-oriented training substantially improves performance on clinically relevant tasks and brings performance close to GPT-4 in some settings. The work provides a practical framework and resources for integrating LLMs into clinical workflows while highlighting the need for careful deployment, task prioritization, and domain-specific benchmarks to bridge the gap between patient-facing models and doctor-assistant AI.

Abstract

The recent success of Large Language Models (LLMs) has had a significant impact on the healthcare field, providing patients with medical advice, diagnostic information, and more. However, due to a lack of professional medical knowledge, patients are easily misled by generated erroneous information from LLMs, which may result in serious medical problems. To address this issue, we focus on tuning the LLMs to be medical assistants who collaborate with more experienced doctors. We first conduct a two-stage survey by inspiration-feedback to gain a broad understanding of the real needs of doctors for medical assistants. Based on this, we construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors, which includes 92K Q\&A samples from 22 tasks and 27 specialists. Moreover, we evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-\textit{test} containing 550 single-turn Q\&A and DotaBench containing 74 multi-turn conversations. The evaluation results indicate that being a medical assistant still poses challenges for existing open-source models, but DoctorFLAN can help them significantly. It demonstrates that the doctor-oriented dataset and benchmarks we construct can complement existing patient-oriented work and better promote medical LLMs research.

LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them

TL;DR

The paper reframes AI in healthcare from autonomous patient consultations to doctor-centered assistance, arguing that clinicians must oversee AI outputs to ensure safety. It develops DoctorFLAN, a Chinese medical dataset of ~92K samples across 22 tasks and 27 specialties, plus DoctorFLAN-test and DotaBench for evaluation, and introduces DotaGPT trained on DoctorFLAN. Through automatic and human evaluations across a suite of baselines, it shows that doctor-oriented training substantially improves performance on clinically relevant tasks and brings performance close to GPT-4 in some settings. The work provides a practical framework and resources for integrating LLMs into clinical workflows while highlighting the need for careful deployment, task prioritization, and domain-specific benchmarks to bridge the gap between patient-facing models and doctor-assistant AI.

Abstract

The recent success of Large Language Models (LLMs) has had a significant impact on the healthcare field, providing patients with medical advice, diagnostic information, and more. However, due to a lack of professional medical knowledge, patients are easily misled by generated erroneous information from LLMs, which may result in serious medical problems. To address this issue, we focus on tuning the LLMs to be medical assistants who collaborate with more experienced doctors. We first conduct a two-stage survey by inspiration-feedback to gain a broad understanding of the real needs of doctors for medical assistants. Based on this, we construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors, which includes 92K Q\&A samples from 22 tasks and 27 specialists. Moreover, we evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-\textit{test} containing 550 single-turn Q\&A and DotaBench containing 74 multi-turn conversations. The evaluation results indicate that being a medical assistant still poses challenges for existing open-source models, but DoctorFLAN can help them significantly. It demonstrates that the doctor-oriented dataset and benchmarks we construct can complement existing patient-oriented work and better promote medical LLMs research.

Paper Structure

This paper contains 49 sections, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Overview of task categories in the LLMs for Doctors dataset across four phases: Admission, Diagnosis, Treatment, and Discharge, illustrating typical and alternative patient care pathways in solid and dashed lines, respectively.
  • Figure 2: Comparative assessment of task efficiency scores for each task according to our survey. The Task Efficiency Score quantifies the potential of various tasks to enhance operational efficiency in medical practice, reflecting improvements in time management, resource use, and overall workflow efficacy.
  • Figure 3: Pearson correlation between human and automatic evaluations on DoctorFLAN-test, illustrating task-level consistency.
  • Figure 4: Visual comparison of task overlap between LLMs as Doctors and LLMs for Doctors datasets, illustrating unique and shared tasks in DoctorFLAN.
  • Figure 5: Interface for Data Verification
  • ...and 9 more figures