The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

Joe B Hakim; Jeffery L Painter; Darmendra Ramcharran; Vijay Kara; Greg Powell; Paulina Sobczak; Chiho Sato; Andrew Bate; Andrew Beam

The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

Joe B Hakim, Jeffery L Painter, Darmendra Ramcharran, Vijay Kara, Greg Powell, Paulina Sobczak, Chiho Sato, Andrew Bate, Andrew Beam

TL;DR

The guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.

Abstract

Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, LLM deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of ``hallucination,'' where LLMs can generate fabricated information. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. To mitigate these risks, we have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety, and potentially applicable to other medical safety-critical contexts. These guardrails include mechanisms to detect anomalous documents to prevent the ingestion of inappropriate data, identify incorrect drug names or adverse event terms, and convey uncertainty in generated content. We integrated these guardrails with an LLM fine-tuned for a text-to-text task, which involves converting both structured and unstructured data within adverse event reports into natural language. This method was applied to translate individual case safety reports, demonstrating effective application in a pharmacovigilance processing task. Our guardrail framework offers a set of tools with broad applicability across various domains, ensuring LLMs can be safely used in high-risk situations by eliminating the occurrence of key errors, including the generation of incorrect pharmacovigilance-related terms, thus adhering to stringent regulatory and quality standards in medical safety-critical environments.

The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

TL;DR

Abstract

Paper Structure (27 sections, 8 figures, 9 tables)

This paper contains 27 sections, 8 figures, 9 tables.

Introduction
Methods
Data Acquisition
Analysis of Individual Case Safety Reports
Development of a Multilingual Corpus for LLM Pretraining
Development of the ICSR translation LLM
Model fine-tuning and generation
Model evaluation
Expert human evaluation of the target text
Phase 1: Establishment of High-Quality Baseline Translations
Phase 2: Evaluation of LLM Translations Against Established Baseline
LLM guardrails for ICSR translations
Document-wise uncertainty quantification (DL-UQ)
MISMATCH (drug and AE mismatching)
Token-wise uncertainty quantification (TL-UQ)
...and 12 more sections

Figures (8)

Figure 1: Graphical summary of the large language model (LLM) workflow. We used extra structured fields and unstructured narrative texts from individual case safety reports (ICSRs), along with historical matched language examples, to fine-tune an LLM. We added a specific task prefix, and generated an English narrative from a Japanese ICSR, and finally checked this process via several guardrails: the document-level uncertainty, drug and adverse event matching, and token-level uncertainty guardrails (see Methods section).
Figure 2: The distribution of document-level uncertainty scores in extraneous, validation, and training samples. The vertical bar represents the minimum validation sample score that is greater than all the validation and training samples.
Figure 3: Illustration of guardrails filtering matched and unmatched drug terms and adverse event (AE) terms in the original Japanese ICSR and the LLM produced English case report. Text spans in blue indicate AEs that were successfully matched between the two texts while spans in yellow indicate AEs that were unmatched. Spans in green represent matched drugs while spans in red represent unmatched drugs. Sensitive information has been redacted with black bars.
Figure 4: Counts of reviewer-identified drug error categories and mismatch guardrail fixes thereof. For each category, counts are given indicating which of the errors had been flagged.
Figure 5: Example flagged spans using the TL-UQ guardrail. Differing levels of red highlighting correspond to increasing relative scores: least color saturation: between 10th percentile and 5th percentile scores for the whole text. Medium color saturation: between 5th and 1st percentile scores. Least color saturation: 1st percentile and above scores. Sensitive information has been redacted with black bars.
...and 3 more figures

The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

TL;DR

Abstract

The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem

Authors

TL;DR

Abstract

Table of Contents

Figures (8)