Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

Yusheng Liao; Yutong Meng; Yuhao Wang; Hongcheng Liu; Yanfeng Wang; Yu Wang

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, Yu Wang

TL;DR

This work introduces Automatic Interactive Evaluation (AIE) to robustly assess medical LLMs through dynamic, multi-turn doctor–patient simulations. Central to AIE is the State Aware Patient Simulator (SAPS), a three-part system (state tracker, memory bank, response generator) that engages doctor LLMs across a 10-action space to diagnose from real-world hospital cases and public exam datasets. The study demonstrates SAPS realism, strong alignment with human judgments, and meaningful correlations with GPT-4 assessments, while revealing differences between open- and closed-source models in interactive diagnostic tasks. The framework offers a scalable, ethically mindful approach to validate clinical capabilities of LLMs beyond static knowledge tests, with broader implications for AI-assisted healthcare and other high-fidelity interactive domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of Large Language Models (LLMs) in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

TL;DR

Abstract

Paper Structure (34 sections, 20 equations, 6 figures, 3 tables)

This paper contains 34 sections, 20 equations, 6 figures, 3 tables.

Introduction
Results
Overview
Datasets
Patient Simulator Test Sets
Doctor LLMs Test Sets
Results of Patient Simulators
Comparative Evaluation on HospitalCases
Automatic Metrics Evaluation on HospitalCases
Metrics Correlation Analysis
Correlation Analysis among Different Subsets
Automatic Evaluation on MedicalExam
Evaluation Format Analysis
Turn Analysis
Discussion
...and 19 more sections

Figures (6)

Figure 1: Overview of the Automatic Interactive Evaluation framework.a State Aware Patient Simulator (SAPS). SAPS structure includes a state tracker for classifying doctor behaviors, a memory bank for information retrieval, and a response generator for creating replies. Sentences with a dark background represent the parts that are activated within SAPS. b Conversation history between SAPS and Doctor LLM. The dialogue with a black border represents the latest round of dialogue. d Evaluated doctor LLM and its prompts. e Diagnosis. After the consultation dialogue, the doctor model must diagnose based on the information gathered during the conversation. 'Conversation' indicates the dialogue history.
Figure 2: Case examples of the consultation conversation between the doctor LLM Qianwen (noted as Doctor) and the patient simulator SAPS (noted as Patient). Each round is numbered and noted with the corresponding doctor LLM action category. The patient information is included above the conversation example.
Figure 3: Results on the patient simulator test set. We employ the six predefined patient metrics to evaluate the performance of different models and humans. a Change of metrics over dialogue turns. The bars and the lines in each plot describe the average scores and the relationship between the metrics and the number of dialogue turns, respectively. b Correlation between patient models and humans. Corr means the value of the correlation factor. c Confusion matrix for state tracking between patient agent and humans.
Figure 4: Success rate without tie in comparative evaluation of doctor LLMs. The first and second rows show the results of the human evaluation from the perspective of the doctor and patient, respectively. The third and fourth rows show the results of the GPT-4 evaluation from the perspective of the doctor and patient.
Figure 5: Analysis of the correlation between automated metrics, GPT-4, and human assessment indicators. All indicators' correlations are tested. Considering the automated metrics are continuous, and GPT-4 and human assessments are ordinal, the Spearman correlation coefficient is used to calculate the correlation between different indicators. a assesses correlation across all test data. b the average human correlation coefficient between automated metrics and GPT-4 assessments. c-e explore correlations within specific subsets: c both models in the comparison are closed-source, d both models in the comparison are open-source, e one model is open-source and the other is closed-source.
...and 1 more figures

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

TL;DR

Abstract

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

Authors

TL;DR

Abstract

Table of Contents

Figures (6)