Table of Contents
Fetching ...

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

Moiz Sadiq Awan, Maryam Raza

Abstract

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

Abstract

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

Paper Structure

This paper contains 36 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Mean rubric scores by criterion and model (scale: 0 to 2). C4 (denial anticipation) shows the greatest cross-model divergence. C1, C3, and C6 are at or near ceiling for all models evaluated.
  • Figure 2: C4 (denial anticipation) full credit rates stratified by PA challenge type. GPT-4o shows a marked deficit on step therapy challenges relative to Claude Sonnet 4.5 and Gemini 2.5 Pro.
  • Figure 3: Mean total score by model and clinical specialty ($n = 9$ per cell). All cells exceed 11.5/12. Perfect scores (12.00) are bolded.
  • Figure 4: Word count distributions across 45 letters per model. All pairwise differences are significant ($p < .001$, Bonferroni-corrected). Group means shown in parentheses.
  • Figure 5: Prevalence of eight secondary PA letter elements by model. Billing codes and authorization duration represent the largest universal gaps. The dotted line marks 50% prevalence.