Table of Contents
Fetching ...

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta, Yejin Choi, Lanyan Fang, Russ B Altman

Abstract

We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Abstract

We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
Paper Structure (60 sections, 6 figures, 10 tables)

This paper contains 60 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Example of expert-guided criteria.
  • Figure 2: Overview of FDARxBench creation.
  • Figure 3: Evidence access ablation (factual + multi-hop). Answer accuracy improves substantially when models are given oracle (gold) passages, but drops in the full-label setting with citation requirements, highlighting evidence selection/grounding as a key bottleneck.
  • Figure 4: Answer correctness vs. citation quality in full-label setting. Relationship between overall answer accuracy and citation overlap (micro-F1) in the full-label setting, showing that better answers do not always imply better citations.
  • Figure 5: Refusal behavior in full-label setting. Precision-recall tradeoff for refusal questions in the full-label setting; models vary in hallucination resistance (precision) despite uniformly high recall.
  • ...and 1 more figures