Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks
R. Patrick Xian, Alex J. Lee, Satvik Lolla, Vincent Wang, Qiming Cui, Russell Ro, Reza Abbasi-Asl
TL;DR
This work probes the brittleness of biomedical domain knowledge in large language models by introducing type-consistent entity substitution (TCES) and a query-efficient embedding-space attack, powered by powerscaled distance-weighted sampling (PDWS). By evaluating both generalist and domain-specialist LLMs on entity-rich biomedical QA datasets (MedQA-USMLE and MedMCQA) with perturbations drawn from FDA-drugs and CTD-diseases, the authors reveal a two-regime attack surface and show that PDWS can achieve strong attack success with limited queries, outperforming blackbox gradient approaches in many settings. They also demonstrate that adversarial entities can distort model explanations (Shapley values), underscoring risks to interpretability and trust in high-stakes biomedical applications. The findings inform robustness auditing and motivate defenses such as retrieval-augmented methods and targeted prompt strategies to mitigate brittle domain knowledge in LLMs.
Abstract
The increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications. Understanding model vulnerabilities in high-stakes and knowledge-intensive tasks is essential for quantifying the trustworthiness of model predictions and regulating their use. The recent discovery of named entities as adversarial examples (i.e. adversarial entities) in natural language processing tasks raises questions about their potential impact on the knowledge robustness of pre-trained and finetuned LLMs in high-stakes and specialized domains. We examined the use of type-consistent entity substitution as a template for collecting adversarial entities for billion-parameter LLMs with biomedical knowledge. To this end, we developed an embedding-space attack based on powerscaled distance-weighted sampling to assess the robustness of their biomedical knowledge with a low query budget and controllable coverage. Our method has favorable query efficiency and scaling over alternative approaches based on random sampling and blackbox gradient-guided search, which we demonstrated for adversarial distractor generation in biomedical question answering. Subsequent failure mode analysis uncovered two regimes of adversarial entities on the attack surface with distinct characteristics and we showed that entity substitution attacks can manipulate token-wise Shapley value explanations, which become deceptive in this setting. Our approach complements standard evaluations for high-capacity models and the results highlight the brittleness of domain knowledge in LLMs.
