Table of Contents
Fetching ...

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

Abstract

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Abstract

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

Paper Structure

This paper contains 21 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Model accuracy across different question types with 95% confidence intervals across five clinical question categories: symptom-condition (symp_cond), condition-symptom (cond_symp), condition-severity (cond_severity), condition-treatment (cond_treat), and condition-followup (cond_followup). Error bars represent 95% confidence intervals.
  • Figure 2: Accuracy delta heatmap showing the difference between question-type-specific accuracy and overall model accuracy for each model. Positive values (red/orange) indicate above-average performance for that question type, while negative values (blue) indicate below-average performance. Values are expressed as percentage points.
  • Figure 3: Accuracy by template and model. Each cell shows the accuracy (%) for a given model and question template. Darker blue indicates higher accuracy. Substantial within-type variance across templates demonstrates that question phrasing affects model performance independently of the underlying clinical relationship.