Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav
TL;DR
This work tackles rare-disease diagnosis using narrative medical cases by introducing a House M.D.-based dataset of 176 symptom–diagnosis pairs and evaluating four advanced LLMs under a structured prompt-based diagnostic task. It reveals substantial cross-model variation (16.48%–38.64%) with newer generations achieving notable gains, yet still struggling with rare diseases. The educationally validated benchmark provides a public framework for evaluating narrative medical reasoning and motivates domain-specific fine-tuning, knowledge-base integration, and hybrid reasoning to move toward clinically useful AI support. Overall, the paper demonstrates progress in AI-assisted diagnostic reasoning while highlighting remaining gaps and offering concrete avenues for future research. The publicly available dataset and evaluation pipeline offer a foundation for reproducible, domain-focused advancement in AI-driven medical diagnosis.
Abstract
Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.
