Table of Contents
Fetching ...

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris G. Ebby, Jillian Gorskic, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar

TL;DR

This study introduces the Provider Documentation Summarization Quality Instrument (PDSQI-9), an LLM-focused adaptation of the PDQI-9 designed to evaluate multi-document, AI-generated clinical summaries from real-world EHR data. Through a semi-Delphi process, nine attributes were defined to address hallucinations, omissions, relevancy, and stigmatizing language, with validation conducted using GPT-4o, Mixtral, and Llama models across 779 summaries. The instrument demonstrated strong internal consistency (Cronbach’s alpha ≈ 0.88) and high inter-rater reliability (ICC ≈ 0.867), with a four-factor structure explaining 58% of variance (organization, clarity, accuracy, utility) and evidence of substantive and discriminant validity. These findings support PDSQI-9 as a robust tool for safely evaluating LLM-generated clinical summaries and guiding integration of AI into healthcare workflows.

Abstract

As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high-versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized ($ρ= -0.190$, $p = 0.037$). Discriminant validity distinguished high- from low-quality summaries ($p < 0.001$). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.

Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

TL;DR

This study introduces the Provider Documentation Summarization Quality Instrument (PDSQI-9), an LLM-focused adaptation of the PDQI-9 designed to evaluate multi-document, AI-generated clinical summaries from real-world EHR data. Through a semi-Delphi process, nine attributes were defined to address hallucinations, omissions, relevancy, and stigmatizing language, with validation conducted using GPT-4o, Mixtral, and Llama models across 779 summaries. The instrument demonstrated strong internal consistency (Cronbach’s alpha ≈ 0.88) and high inter-rater reliability (ICC ≈ 0.867), with a four-factor structure explaining 58% of variance (organization, clarity, accuracy, utility) and evidence of substantive and discriminant validity. These findings support PDSQI-9 as a robust tool for safely evaluating LLM-generated clinical summaries and guiding integration of AI into healthcare workflows.

Abstract

As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach's alpha for structural validity, inter-rater reliability (ICC and Krippendorff's alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high-versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (, ). Discriminant validity distinguished high- from low-quality summaries (). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.
Paper Structure (15 sections, 6 figures, 6 tables)

This paper contains 15 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Length of Patient Notes vs. Mean Evaluator Score Scatter plots of the mean score among evaluators compared to the token length of the patient notes provided for summarization. Trend lines are superimposed on the plot along with the Spearman $\rho$ (denoted as R) coefficient and p-value (denoted as p). Each plot corresponds to a single attribute from the PDSQI-9 instrument.
  • Figure 2: Length of Patient Notes vs. Standard Deviation of Evaluator Scores Scatter plots of the score standard deviation among evaluators compared to the token length of the patient notes provided for summarization. Trend lines are superimposed on the plot along with the Spearman $\rho$ (denoted as R) coefficient and its p-value (denoted as p). Each plot corresponds to a single attribute from the PDSQI-9 instrument.
  • Figure 3: Likert Score Distributions by Attribute A density ridge across the 5 Likert scale points that evaluators used to score each attribute from the PDSQI-9 instrument. Each attribute is provided separately and identified on the y-axis. Distributions include the scores from every evaluator on every unique patient summary reported in this study.
  • Figure 4: Time Spent per Evaluation by Evaluator Experience A boxplot describing the amount of time it took an evaluator to complete each evaluation in full. We have provided the information for the junior and senior physicians separately for comparison. The time is represented in minutes.
  • Figure 5: PDSQI-9 Attribute Correlation Plot A visual representation of the correlations between the scores for each attribute in the PDSQI-9. The strength of the correlation is represented by the color and size of each circle.
  • ...and 1 more figures