Table of Contents
Fetching ...

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle Bitterman

TL;DR

This work investigates how substituting brand and generic drug names affects biomedical language model performance. It introduces RABBITS, a robustness benchmark created by expert-guided drug-name swaps in MedQA and MedMCQA, and assesses both open-source and API models in a zero-shot setting. The study finds consistent performance degradations linked to drug-name interchangeability, with larger models showing larger drops and API models generally more robust than open models. A key insight is that substantial contamination of test data in pretraining datasets likely drives observed fragility, underscoring the need for contamination-aware evaluation and domain-specific robustness tests in biomedical NLP. Overall, RABBITS provides a focused framework and leaderboard to quantify and address nomenclature-related brittleness in clinical QA benchmarks, with implications for reliability in patient-facing AI tools.

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

TL;DR

This work investigates how substituting brand and generic drug names affects biomedical language model performance. It introduces RABBITS, a robustness benchmark created by expert-guided drug-name swaps in MedQA and MedMCQA, and assesses both open-source and API models in a zero-shot setting. The study finds consistent performance degradations linked to drug-name interchangeability, with larger models showing larger drops and API models generally more robust than open models. A key insight is that substantial contamination of test data in pretraining datasets likely drives observed fragility, underscoring the need for contamination-aware evaluation and domain-specific robustness tests in biomedical NLP. Overall, RABBITS provides a focused framework and leaderboard to quantify and address nomenclature-related brittleness in clinical QA benchmarks, with implications for reliability in patient-facing AI tools.

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.
Paper Structure (19 sections, 4 figures, 7 tables)

This paper contains 19 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: RABBITS dataset generation workflow.
  • Figure 2: Performance of models on the filtered original datasets compared to the generic-to-brand versions. The dashed diagonal line represents the ideal scenario where synonym swaps do not affect model performance.
  • Figure 3: Performance of models on multi-choice question identification of brand-generic drug pairs ordered in increasing model size. Gemini results are missing due to Google's API safety filters.
  • Figure 4: Performance of models on the filtered original datasets compared to the generic-to-brand versions for MedMCQA and MedQA subsets. Negative values indicate worse performance on the swapped dataset.