Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator
Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, Yu Wang
TL;DR
This work introduces Automatic Interactive Evaluation (AIE) to robustly assess medical LLMs through dynamic, multi-turn doctor–patient simulations. Central to AIE is the State Aware Patient Simulator (SAPS), a three-part system (state tracker, memory bank, response generator) that engages doctor LLMs across a 10-action space to diagnose from real-world hospital cases and public exam datasets. The study demonstrates SAPS realism, strong alignment with human judgments, and meaningful correlations with GPT-4 assessments, while revealing differences between open- and closed-source models in interactive diagnostic tasks. The framework offers a scalable, ethically mindful approach to validate clinical capabilities of LLMs beyond static knowledge tests, with broader implications for AI-assisted healthcare and other high-fidelity interactive domains.
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of Large Language Models (LLMs) in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.
