A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI

Deep Bhatt; Surya Ayyagari; Anuruddh Mishra

A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI

Deep Bhatt, Surya Ayyagari, Anuruddh Mishra

TL;DR

This paper tackles the persistent problem of diagnostic errors in healthcare by introducing a scalable benchmarking framework that combines validated clinical vignettes with AI-powered patient actors to evaluate health AI systems in naturalistic conversations. The authors apply the framework to August, reporting top-1 accuracy of $81.8\%$ and top-2 accuracy of $85.0\%$, along with a $95.8\%$ specialist-referral accuracy and substantially fewer questions per consultation compared with traditional symptom checkers. Key contributions include a reproducible, scalable evaluation approach and evidence that an AI chatbot can outperform several symptom checkers and even some clinicians in structured vignette-based testing, while maintaining empathetic dialogue. The results highlight the potential to improve access to high-quality health information and guide appropriate care, though the work acknowledges the need for real-world validation, integration of objective clinical data, and broader demographic representation before clinical deployment.

Abstract

Diagnostic errors in healthcare persist as a critical challenge, with increasing numbers of patients turning to online resources for health information. While AI-powered healthcare chatbots show promise, there exists no standardized and scalable framework for evaluating their diagnostic capabilities. This study introduces a scalable benchmarking methodology for assessing health AI systems and demonstrates its application through August, an AI-driven conversational chatbot. Our methodology employs 400 validated clinical vignettes across 14 medical specialties, using AI-powered patient actors to simulate realistic clinical interactions. In systematic testing, August achieved a top-one diagnostic accuracy of 81.8% (327/400 cases) and a top-two accuracy of 85.0% (340/400 cases), significantly outperforming traditional symptom checkers. The system demonstrated 95.8% accuracy in specialist referrals and required 47% fewer questions compared to conventional symptom checkers (mean 16 vs 29 questions), while maintaining empathetic dialogue throughout consultations. These findings demonstrate the potential of AI chatbots to enhance healthcare delivery, though implementation challenges remain regarding real-world validation and integration of objective clinical data. This research provides a reproducible framework for evaluating healthcare AI systems, contributing to the responsible development and deployment of AI in clinical settings.

A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI

TL;DR

Abstract

A Scalable Approach to Benchmarking the In-Conversation Differential Diagnostic Accuracy of a Health AI

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)