Table of Contents
Fetching ...

H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

Seungseop Lim, Gibaeg Kim, Hyunkyung Lee, Wooseok Han, Jean Seo, Jaehyo Yoo, Eunho Yang

TL;DR

The paper tackles the problem that evaluating LLMs for differential diagnosis with flat metrics does not capture clinical utility. It introduces H-DDx, a hierarchical evaluation framework that maps free-text diagnoses to the ICD-10 taxonomy and uses a Hierarchical DDx F1 (HDF1) metric to reward clinically related outputs. The approach combines embedding-based retrieval with LLM reranking to map predictions to ICD-10 codes and expands predictions and ground truth along ICD-10 ancestors for evaluation. On a large synthetic benchmark with 22 LLMs, HDF1 reveals that domain-specialized models achieve strong clinical coherence and significantly reorders model rankings compared to flat Top-5 accuracy, while also enabling interpretable analysis of error patterns. This framework advances practical evaluation of diagnostic AI by aligning metrics with clinical reasoning and potential utility in patient care, though validation on real-world data and richer ontologies remains for future work.

Abstract

An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.

H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

TL;DR

The paper tackles the problem that evaluating LLMs for differential diagnosis with flat metrics does not capture clinical utility. It introduces H-DDx, a hierarchical evaluation framework that maps free-text diagnoses to the ICD-10 taxonomy and uses a Hierarchical DDx F1 (HDF1) metric to reward clinically related outputs. The approach combines embedding-based retrieval with LLM reranking to map predictions to ICD-10 codes and expands predictions and ground truth along ICD-10 ancestors for evaluation. On a large synthetic benchmark with 22 LLMs, HDF1 reveals that domain-specialized models achieve strong clinical coherence and significantly reorders model rankings compared to flat Top-5 accuracy, while also enabling interpretable analysis of error patterns. This framework advances practical evaluation of diagnostic AI by aligning metrics with clinical reasoning and potential utility in patient care, though validation on real-world data and richer ontologies remains for future work.

Abstract

An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.

Paper Structure

This paper contains 45 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of the H-DDx framework and conventional flat metrics. For a patient with Influenza, flat metrics fail to distinguish between the clinically relevant DDx set from Prediction A (related respiratory infections) and the irrelevant list from Prediction B (e.g., Migraine), scoring both poorly. H-DDx uses the ICD-10 taxonomy for a more nuanced evaluation. By expanding differential diagnosis sets into the taxonomy, it identifies Prediction A's outputs as clinically relevant near-misses, while Prediction B's are distant errors. HDF1 score quantifies this distinction, capturing the superior clinical utility of Prediction A that flat metrics overlook.
  • Figure 2: Overview of the H-DDx framework.
  • Figure 3: Hierarchical cascade pattern in diagnostic performance across ICD-10 levels. The overlapping bars show HDF1 scores calculated for each hierarchy level, with models sorted by their performance at the Subcategory level.
  • Figure 4: Shift in model rankings from flat (Top-5 Accuracy) to hierarchical (HDF1) evaluation. The figure illustrates the rank changes when moving from a conventional accuracy metric (left) to the proposed hierarchical score (right). Models are color-coded as proprietary (yellow), open-source (blue), and medically fine-tuned (green). Note the significant rank improvement of medically fine-tuned models like MediPhi, which highlights HDF1's ability to better capture clinical relevance.