Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin, Nima Aghaeepour
TL;DR
This study introduces Mirror, an evidence-grounded clinical reasoning system built on a curated endocrinology and cardiometabolic corpus and a structured reasoning stack, evaluated against frontier LLMs on the ESAP 2025 endocrinology board-style exam. Mirror achieved 87.5% accuracy, outperforming GPT-5.2, GPT-5, and Gemini-3-Pro, while providing complete evidence traceability with 74.2% of outputs citing guideline-tier sources and 100% citation accuracy in manual verification. The results indicate that domain-specific evidence curation and explicit provenance can surpass unconstrained web retrieval for subspecialty clinical reasoning and support auditability essential for clinical deployment. The work lays groundwork for broader validation across subspecialties and real-world clinical contexts, highlighting the potential for credible AI-assisted decision support in medicine.
Abstract
Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.
