FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Betty Xiong; Jillian Fisher; Benjamin Newman; Meng Hu; Shivangi Gupta; Yejin Choi; Lanyan Fang; Russ B Altman

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta, Yejin Choi, Lanyan Fang, Russ B Altman

Abstract

We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Abstract

Paper Structure (60 sections, 6 figures, 10 tables)

This paper contains 60 sections, 6 figures, 10 tables.

Introduction
Background and Related Work
FDA Drug Labels
Biomedical QA Datasets
Dataset: Expert-Guided FDA Label QA Benchmark
Source Documents and Preprocessing
Question Generation Pipeline
Question Types
Tasks
Metrics
Experiments and Results
Model Results
Retriever Results
Conclusion
Limitations
...and 45 more sections

Figures (6)

Figure 1: Example of expert-guided criteria.
Figure 2: Overview of FDARxBench creation.
Figure 3: Evidence access ablation (factual + multi-hop). Answer accuracy improves substantially when models are given oracle (gold) passages, but drops in the full-label setting with citation requirements, highlighting evidence selection/grounding as a key bottleneck.
Figure 4: Answer correctness vs. citation quality in full-label setting. Relationship between overall answer accuracy and citation overlap (micro-F1) in the full-label setting, showing that better answers do not always imply better citations.
Figure 5: Refusal behavior in full-label setting. Precision-recall tradeoff for refusal questions in the full-label setting; models vary in hallucination resistance (precision) despite uniformly high recall.
...and 1 more figures

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Abstract

FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

Authors

Abstract

Table of Contents

Figures (6)