Table of Contents
Fetching ...

OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Sandhanakrishnan Ravichandran, Shivesh Kumar, Rogerio Corga Da Silva, Miguel Romano, Reinhard Berkels, Michiel van der Heijden, Olivier Fail, Valentine Emmanuel Gnanapragasam

TL;DR

The paper investigates the readiness of an LLM-based clinical assistant (DR. INFO) for real-world medical conversations by evaluating it with HealthBench, a rubric-based benchmark designed to capture clinical behaviors beyond traditional exams. DR. INFO, built as an agentic retrieval-augmented generator, is benchmarked against frontier LLMs and other agentic assistants across a large Hard subset (n=1000) and a representative subset (n=100), showing superior performance in communication, instruction following, and accuracy, with notable improvements in context awareness and completeness. The findings argue for the value of behavior-level evaluation in healthcare AI and demonstrate that rubric-based assessments can reveal gaps and guide safe, reliable deployment. The work also discusses limitations (subjectivity of rubrics, text-only modality) and suggests avenues for broader, multimodal benchmarking and future scaling to full HealthBench datasets.

Abstract

Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stakes clinical scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, contextual awareness, and uncertainty handling. To address these limitations, we evaluate our agentic RAG-based clinical support assistant, DR. INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR. INFO achieves a HealthBench Hard score of 0.68, outperforming leading frontier LLMs including the GPT-5 model family (GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40), Grok 3 (0.23), Gemini 2.5 Pro (0.19), and Claude 3.7 Sonnet (0.02) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence and Pathway.md, now DoxGPT by Doximity), it maintains a performance lead with a HealthBench Hard score of 0.72. These results highlight the strengths of DR. INFO in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and response completeness. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building reliable and trustworthy AI-enabled clinical support systems.

OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

TL;DR

The paper investigates the readiness of an LLM-based clinical assistant (DR. INFO) for real-world medical conversations by evaluating it with HealthBench, a rubric-based benchmark designed to capture clinical behaviors beyond traditional exams. DR. INFO, built as an agentic retrieval-augmented generator, is benchmarked against frontier LLMs and other agentic assistants across a large Hard subset (n=1000) and a representative subset (n=100), showing superior performance in communication, instruction following, and accuracy, with notable improvements in context awareness and completeness. The findings argue for the value of behavior-level evaluation in healthcare AI and demonstrate that rubric-based assessments can reveal gaps and guide safe, reliable deployment. The work also discusses limitations (subjectivity of rubrics, text-only modality) and suggests avenues for broader, multimodal benchmarking and future scaling to full HealthBench datasets.

Abstract

Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stakes clinical scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, contextual awareness, and uncertainty handling. To address these limitations, we evaluate our agentic RAG-based clinical support assistant, DR. INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR. INFO achieves a HealthBench Hard score of 0.68, outperforming leading frontier LLMs including the GPT-5 model family (GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40), Grok 3 (0.23), Gemini 2.5 Pro (0.19), and Claude 3.7 Sonnet (0.02) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence and Pathway.md, now DoxGPT by Doximity), it maintains a performance lead with a HealthBench Hard score of 0.72. These results highlight the strengths of DR. INFO in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and response completeness. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building reliable and trustworthy AI-enabled clinical support systems.

Paper Structure

This paper contains 20 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparison of axis-wise scores for DR. INFO and other frontier LLMs on the HealthBench Hard subset. Model scores are visually approximated from arora2024healthbench.
  • Figure 2: HealthBench axis-wise scores for 100-sample subset: DR. INFO vs. OpenEvidence vs. Pathway.md (DoxGPT)