Table of Contents
Fetching ...

When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models

Yubo Li, Ramayya Krishnan, Rema Padman

Abstract

Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers' handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.

When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models

Abstract

Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers' handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.
Paper Structure (22 sections, 6 equations, 4 figures, 6 tables)

This paper contains 22 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Center distribution map for the handbook corpus. The 23 contributing U.S. transplant centers are geographically dispersed across the country.
  • Figure 2: Distribution of benchmark questions by organ type and topic category.
  • Figure 3: Overview of the experimental pipeline.
  • Figure 4: Selected comparison matrices illustrating heterogeneity patterns. Each cell represents a pairwise comparison between two center handbooks for a single question. Colors: grey = Absent, green = Consistent, yellow = Complementary, blue = Divergent, black = Contradictory. Handbook labels on axes are anonymized (organ-center_index-phase). Matrices are symmetric; diagonal entries are Consistent by convention.