Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Amir Hosseinian; MohammadReza Zare Shahneh; Umer Mansoor; Gilbert Szeto; Kirill Karlin; Nima Aghaeepour

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin, Nima Aghaeepour

TL;DR

This study introduces Mirror, an evidence-grounded clinical reasoning system built on a curated endocrinology and cardiometabolic corpus and a structured reasoning stack, evaluated against frontier LLMs on the ESAP 2025 endocrinology board-style exam. Mirror achieved 87.5% accuracy, outperforming GPT-5.2, GPT-5, and Gemini-3-Pro, while providing complete evidence traceability with 74.2% of outputs citing guideline-tier sources and 100% citation accuracy in manual verification. The results indicate that domain-specific evidence curation and explicit provenance can surpass unconstrained web retrieval for subspecialty clinical reasoning and support auditability essential for clinical deployment. The work lays groundwork for broader validation across subspecialties and real-world clinical contexts, highlighting the potential for credible AI-assisted decision support in medicine.

Abstract

Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

TL;DR

Abstract

Paper Structure (21 sections, 6 figures, 4 tables)

This paper contains 21 sections, 6 figures, 4 tables.

Introduction
Methods
Benchmark Dataset
Systems Evaluated
Web-Assisted Baseline Protocol
Mirror System Architecture
Evaluation Protocol
Citation Verification Protocol
Results
Overall Performance
Performance Relative to ESAP Respondent Reference
Top-2 Accuracy
Performance by Question Type
Error Analysis
Evidence Traceability and Citation Accuracy
...and 6 more sections

Figures (6)

Figure 1: Distribution of clinical reasoning domains in ESAP 2025. The left panel shows the total number of questions involving each domain. The main panel displays intersection sizes for domain combinations, sorted by frequency. The matrix indicates which domains comprise each intersection. Treatment-related reasoning was most prevalent (n=73), followed by pathophysiology (n=60) and diagnosis (n=57). The most common domain combination was risk/prognosis with treatment (n=16).
Figure 2: Comparative performance on ESAP 2025 endocrinology examination. Mirror achieved 87.5% accuracy compared to 74.6% for the best-performing frontier LLM (GPT-5.2) and 62.3% for the human reference (ESAP respondent mean). Optional dashed line indicates an internal upper-bound reference condition, provided for context. Error bars represent 95% confidence intervals. *$p < 0.05$ vs. Mirror.
Figure 3: Performance stratified by question difficulty (based on human accuracy). Mirror maintained strong performance across all difficulty tiers, with particular advantage on hard questions (human accuracy $<$50%) where it achieved 76.7% compared to 53.3% for GPT-5.2 and 39.5% human mean accuracy.
Figure 4: Paired win/loss analysis comparing Mirror to each baseline. For each comparison, "wins" indicate questions Mirror answered correctly while the baseline erred; "losses" indicate the reverse. Mirror achieved a net positive margin against all baselines, with the largest advantage over Gemini-3-Pro (19 net wins).
Figure 5: Performance by question type across systems. Mirror demonstrated consistent performance across all question categories, with the largest advantage in treatment questions (15.5 percentage point difference vs. GPT-5.2).
...and 1 more figures

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

TL;DR

Abstract

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Authors

TL;DR

Abstract

Table of Contents

Figures (6)