Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

R. Patrick Xian; Alex J. Lee; Satvik Lolla; Vincent Wang; Qiming Cui; Russell Ro; Reza Abbasi-Asl

Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

R. Patrick Xian, Alex J. Lee, Satvik Lolla, Vincent Wang, Qiming Cui, Russell Ro, Reza Abbasi-Asl

TL;DR

This work probes the brittleness of biomedical domain knowledge in large language models by introducing type-consistent entity substitution (TCES) and a query-efficient embedding-space attack, powered by powerscaled distance-weighted sampling (PDWS). By evaluating both generalist and domain-specialist LLMs on entity-rich biomedical QA datasets (MedQA-USMLE and MedMCQA) with perturbations drawn from FDA-drugs and CTD-diseases, the authors reveal a two-regime attack surface and show that PDWS can achieve strong attack success with limited queries, outperforming blackbox gradient approaches in many settings. They also demonstrate that adversarial entities can distort model explanations (Shapley values), underscoring risks to interpretability and trust in high-stakes biomedical applications. The findings inform robustness auditing and motivate defenses such as retrieval-augmented methods and targeted prompt strategies to mitigate brittle domain knowledge in LLMs.

Abstract

The increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications. Understanding model vulnerabilities in high-stakes and knowledge-intensive tasks is essential for quantifying the trustworthiness of model predictions and regulating their use. The recent discovery of named entities as adversarial examples (i.e. adversarial entities) in natural language processing tasks raises questions about their potential impact on the knowledge robustness of pre-trained and finetuned LLMs in high-stakes and specialized domains. We examined the use of type-consistent entity substitution as a template for collecting adversarial entities for billion-parameter LLMs with biomedical knowledge. To this end, we developed an embedding-space attack based on powerscaled distance-weighted sampling to assess the robustness of their biomedical knowledge with a low query budget and controllable coverage. Our method has favorable query efficiency and scaling over alternative approaches based on random sampling and blackbox gradient-guided search, which we demonstrated for adversarial distractor generation in biomedical question answering. Subsequent failure mode analysis uncovered two regimes of adversarial entities on the attack surface with distinct characteristics and we showed that entity substitution attacks can manipulate token-wise Shapley value explanations, which become deceptive in this setting. Our approach complements standard evaluations for high-capacity models and the results highlight the brittleness of domain knowledge in LLMs.

Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

TL;DR

Abstract

Paper Structure (41 sections, 13 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 13 equations, 13 figures, 2 tables, 1 algorithm.

Introduction
Related works
Text substitution attacks
Robustness in QA
Characteristics of adversarial examples
Entity-centric adversarial attacks
Adversarial distractor generation
Powerscaled distance-weighted sampling
Sampling view of zeroth-order adversarial attack
Implementing attacks on biomedical QA
Dataset selection
Entity datasets
Biomedical QA datasets
Type-consistent entity substitution (TCES)
Attack execution on LLMs
...and 26 more sections

Figures (13)

Figure 1: Entity substitution attack on QA with adversarial distractors. (a) Typical query scaling curves in low- and high-query-budget attack settings. (b) An adversarial distractor example found by type-consistent entity substitution (highlighted in blue). The correct (top) and incorrect (bottom) model responses (checkmarked) before and after perturbation are included. (c) Illustration of the attack scheme in embedding space by PDWS for the example in (b). $\mathcal{E}$ represents the vocabulary set. D is the key to the question, A-C the original distractors, C$'$ is an adversarial distractor at distance $h$ from the key D.
Figure 2: Powerscaled DWS of adversarial distractors exhibits a two-regime effect at negative and positive $n$ values (see Eq. \ref{['eq:prob']}) in ASR (top) and diversity index (bottom) of replacement entities in successful attacks. Local maxima in ASR with a finite $n$ are also present in each regime. The vertical dashed line indicates the location of random sampling. The observed similar behaviors are compared across models and datasets in (a) Flan-T5-xxl on MedQA-USMLE, (b) Palmyra-Med-20B on MedQA-USMLE, (c) Flan-T5-xl on MedMCQA, (d) MedAlpaca-7B on MedMCQA. Disease and drug-mention questions are separated by colors. The average prompt semantic similarity displayed in (e)-(h) is calculated for the successful attacks obtained from the corresponding attack settings in (a)-(d), respectively.
Figure 3: Model robustness from single-query adversarial attacks using results in Table \ref{['tab:onequery']}. The models (GLMs and BLMs) were evaluated on drug-mention questions from MedQA-USMLE (left) and disease-mention questions from MedMCQA (right) datasets. The GLMs and BLMs are ordered horizontally by their sizes. The bar colors distinguish between different attack methods. Perturbations by random sampling (RS) are in purple. Perturbations by powerscaled DWS (PDWS) in the $n<0$ and $n>0$ regimes are in blue and grey, respectively.
Figure 4: Example scaling curves of the query budget against ASR for disease-mention questions with (a) Flan-T5-xl model on MedMCQA dataset and (b) MedAlpaca model on MedQA-USMLE dataset. The curves are generated using TCES of disease names with the methods described in the legends. The DiscreteZOO attacks were run in the low-query-budget setting. Executing the random sampling (RS) attack doesn't require a text embedding, while the other attacks were evaluated with the CODER or GTE-base as text embedding. In (c) and (d), query scaling of the attacks based on $B$-nearest and $B$-farthest element sampling ($B$ is query budget) for the same datasets and models as in (a) and (b) are compared with PDWS.
Figure 5: (Top) An entity substitution attack using n-acetylglucosamine as the replacement entity for metoprolol creates an adversarial distractor. (Bottom) Heatmaps of token-wise Shapley values for a question before and after the adversarial attack on choice C. The model prediction changes from the correct choice of B (unperturbed) to the incorrect choice of D (adversarially perturbed).
...and 8 more figures

Theorems & Definitions (4)

Definition 3.1
Definition 3.2: Adversarial distractor
Example 3.1
Definition 3.3: Text span representations

Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

TL;DR

Abstract

Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (4)