OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries
Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam
TL;DR
The paper investigates the readiness of an LLM-based clinical assistant (DR. INFO) for real-world medical conversations by evaluating it with HealthBench, a rubric-based benchmark designed to capture clinical behaviors beyond traditional exams. DR. INFO, built as an agentic retrieval-augmented generator, is benchmarked against frontier LLMs and other agentic assistants across a large Hard subset (n=1000) and a representative subset (n=100), showing superior performance in communication, instruction following, and accuracy, with notable improvements in context awareness and completeness. The findings argue for the value of behavior-level evaluation in healthcare AI and demonstrate that rubric-based assessments can reveal gaps and guide safe, reliable deployment. The work also discusses limitations (subjectivity of rubrics, text-only modality) and suggests avenues for broader, multimodal benchmarking and future scaling to full HealthBench datasets.
Abstract
Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stakes clinical scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, contextual awareness, and uncertainty handling. To address these limitations, we evaluate our agentic RAG-based clinical support assistant, DR. INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR. INFO achieves a HealthBench Hard score of 0.68, outperforming leading frontier LLMs including the GPT-5 model family (GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40), Grok 3 (0.23), Gemini 2.5 Pro (0.19), and Claude 3.7 Sonnet (0.02) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence and Pathway.md, now DoxGPT by Doximity), it maintains a performance lead with a HealthBench Hard score of 0.72. These results highlight the strengths of DR. INFO in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and response completeness. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building reliable and trustworthy AI-enabled clinical support systems.
