Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators
Zhaocheng Liu, Quan Tu, Wen Ye, Yu Xiao, Zhishou Zhang, Hengfu Cui, Yalun Zhu, Qiang Ju, Shizheng Li, Jian Xie
TL;DR
The paper tackles the underexplored link between patient inquiry and diagnosis in online medical consultation by introducing a patient simulator guided by real doctor-patient dialogue strategies. It builds a data-driven pipeline using MedDialog and Chinese medical records, expands dialogue strategy tags with GPT-4o, and trains a LoRA-tuned Chinese-language simulator via in-context learning and supervised fine-tuning. The simulator achieves substantially lower hallucination rates and higher anthropomorphism than baselines, enabling more realistic evaluation and generation of synthetic data. Experiments demonstrate Liebig's law in the inquiry-diagnosis relationship and categorize inquiries into four types, providing actionable insights for optimizing inquiry allocation within 3–5 rounds of interaction.
Abstract
Recently, large language models have shown great potential to transform online medical consultation. Despite this, most research targets improving diagnostic accuracy with ample information, often overlooking the inquiry phase. Some studies try to evaluate or refine doctor models by using prompt-engineered patient agents. However, prompt engineering alone falls short in accurately simulating real patients. We need to explore new paradigms for patient simulation. Furthermore, the relationship between inquiry and diagnosis remains unexplored. This paper extracts dialogue strategies from real doctor-patient conversations to guide the training of a patient simulator. Our simulator shows higher anthropomorphism and lower hallucination rates, using dynamic dialogue strategies. This innovation offers a more accurate evaluation of diagnostic models and generates realistic synthetic data. We conduct extensive experiments on the relationship between inquiry and diagnosis, showing they adhere to Liebig's law: poor inquiry limits diagnosis effectiveness, regardless of diagnostic skill, and vice versa. The experiments also reveal substantial differences in inquiry performance among models. To delve into this phenomenon, the inquiry process is categorized into four distinct types. Analyzing the distribution of inquiries across these types helps explain the performance differences. The weights of our patient simulator are available https://github.com/PatientSimulator/PatientSimulator.
