Table of Contents
Fetching ...

InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun

TL;DR

InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop is presented, with results demonstrating near 100% compliance with 18 core regulatory rules derived from FDA guidelines.

Abstract

Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.

InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation

TL;DR

InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop is presented, with results demonstrating near 100% compliance with 18 core regulatory rules derived from FDA guidelines.

Abstract

Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model's 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.

Paper Structure

This paper contains 12 sections, 6 figures.

Figures (6)

  • Figure 1: Overview of InformGen Workflow. a, The clinical trial protocol is parsed as the primary knowledge document. The raw PDF is converted to Markdown text, segmented into chunks with metadata on section titles and page numbers, and encoded into dense embeddings for storage in a vector database. Additionally, two specialized knowledge documents, the schedule of assessment (SOA) table and the procedure-risk table, are extracted from protocols and subjected to human review before use in content generation to ensure high factual accuracy. On the right, the FDA guidelines for ICF drafting are processed to create regulatory policies for compliance evaluation, while the site-specific consent template is parsed to provide structured instructions for content generation. b, InformGen generates ICF sections using a retrieval-augmented generation pipeline. It first formulates search queries with metadata filters, retrieves the most relevant chunks, and applies parsed instructions to generate the content. For sections requiring trial schedules or risk-related details, the SOA or procedure-risk table is incorporated as an additional input. c, The generated content is evaluated for regulatory compliance and factual accuracy. Section-specific rule sets, derived from FDA guidelines, are used to assess whether the content violates any compliance requirements. Additionally, key facts are extracted from human-written reference ICFs as ground truth and compared against the generated content to compute an accuracy score.
  • Figure 1: Rule set for evaluating compliance of generated content."Section Title": Manually assigned section to which the rule applies and is used for assessment. "Rule Name": The designated name of the rule. "Rule Description": A detailed explanation of the rule, derived from the FDA guideline document.
  • Figure 2: Dataset statistics. Characteristics of clinical trials and the associated protocols and informed consent forms used in the experiments. Year: the start year of the clinical trial; Region: which region this trial was primarily initiated and conducted; Enrollment: the number of actual enrollment subjects in the trial; Protocol pages: the number of pages of the clinical trial protocol; ICF pages: the number of pages of the human written consent form document; Conditions: the primary targeted conditions of the trial.
  • Figure 3: Evaluating compliance of InformGen and baselines. a, Automatic evaluation of the compliance rate of the generated content by InformGen and the baseline RAG+Prompting, using GPT-4o as a judge. Both methods are evaluated in two variants, utilizing either GPT-4o or GPT-4o-mini as the underlying LLM. b, In-depth trial-level comparison of InformGen and RAG+Prompting, analyzing compliance rates across all four clinical trial phases. c, Confusion matrix comparing compliance judgments made by GPT-4o and human annotators for the generated ICF content. TPR: True Positive Rate; TNR: True Negative Rate. d, Manual evaluation and GPT-4o-based assessment of compliance for the generated content across 100 trials.
  • Figure 4: Evaluating the factual accuracy of InformGen and baselines. a, Automatic evaluation of the factual accuracy of content generated by InformGen and the baseline RAG+Prompting. Each ICF section is assessed against 3$\sim$5 key facts, with $n$ indicating the total number of evaluated facts. b, Trial-level breakdown of results by clinical trial phase. A side-by-side comparison of InformGen and RAG+Prompting for each trial shows that InformGen achieves a winning rate of 55%-80% in factual accuracy improvement. c, Confusion matrix comparing factual accuracy evaluations from GPT-4o (automatic evaluation) with human annotators. TPR: True Positive Rate; TNR: True Negative Rate. d, Analysis of factual accuracy across trials with varying protocol complexity. The x-axis represents the number of protocol pages, showing that InformGen exhibits greater robustness compared to RAG+Prompting as protocol complexity increases. e, Comparison of manual evaluation and GPT-4o-based assessment of factual accuracy across 100 trials.
  • ...and 1 more figures