Table of Contents
Fetching ...

Evaluating the Robustness of Adverse Drug Event Classification Models Using Templates

Dorothea MacPhail, David Harbecke, Lisa Raithel, Sebastian Möller

TL;DR

This paper addresses the evaluation gap in adverse drug event ADE classification by introducing a template-based, behavior-focused testing framework that probes four capabilities: Temporal Order, negation, sentiment, and beneficial effects. By fine-tuning two transformer models on a merged social media ADE dataset and evaluating them with a bench of 99 templates (1505 variations) and 11,265 test cases, the authors demonstrate that similar held-out F1 scores can mask substantive differences in capability robustness. The results reveal that models struggle to distinguish ADEs from beneficial effects and to robustly handle temporal and negation variations, even when overall performance appears strong. The work provides a practical, reproducible template bank for deeper ADE model evaluation and suggests directions for broader capability testing and data augmentation to improve real-world reliability.

Abstract

An adverse drug effect (ADE) is any harmful event resulting from medical drug treatment. Despite their importance, ADEs are often under-reported in official channels. Some research has therefore turned to detecting discussions of ADEs in social media. Impressive results have been achieved in various attempts to detect ADEs. In a high-stakes domain such as medicine, however, an in-depth evaluation of a model's abilities is crucial. We address the issue of thorough performance evaluation in English-language ADE detection with hand-crafted templates for four capabilities: Temporal order, negation, sentiment, and beneficial effect. We find that models with similar performance on held-out test sets have varying results on these capabilities.

Evaluating the Robustness of Adverse Drug Event Classification Models Using Templates

TL;DR

This paper addresses the evaluation gap in adverse drug event ADE classification by introducing a template-based, behavior-focused testing framework that probes four capabilities: Temporal Order, negation, sentiment, and beneficial effects. By fine-tuning two transformer models on a merged social media ADE dataset and evaluating them with a bench of 99 templates (1505 variations) and 11,265 test cases, the authors demonstrate that similar held-out F1 scores can mask substantive differences in capability robustness. The results reveal that models struggle to distinguish ADEs from beneficial effects and to robustly handle temporal and negation variations, even when overall performance appears strong. The work provides a practical, reproducible template bank for deeper ADE model evaluation and suggests directions for broader capability testing and data augmentation to improve real-world reliability.

Abstract

An adverse drug effect (ADE) is any harmful event resulting from medical drug treatment. Despite their importance, ADEs are often under-reported in official channels. Some research has therefore turned to detecting discussions of ADEs in social media. Impressive results have been achieved in various attempts to detect ADEs. In a high-stakes domain such as medicine, however, an in-depth evaluation of a model's abilities is crucial. We address the issue of thorough performance evaluation in English-language ADE detection with hand-crafted templates for four capabilities: Temporal order, negation, sentiment, and beneficial effect. We find that models with similar performance on held-out test sets have varying results on these capabilities.
Paper Structure (30 sections, 6 figures, 7 tables)

This paper contains 30 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The process for creating test cases.
  • Figure 2: The different data sources for creating the custom dataset for fine-tuning the models.
  • Figure 3: Per-class performance of fine-tuned BioRedditBERT (left) and XLM-RoBERTa (right) on the test set (grey box baseline) and the capability-based test cases. The three distinct types of Temporal Order tests refer to variety of Temporal Order templates (standard, single and double time entity) highlighted in Table \ref{['tab:testcase-examples']}. Most test cases are more difficult for the model to solve than the samples from the custom dataset. The biggest difference between the models is the performance on the negation test cases with ADE label, where BioRedditBERT solves 20% more test cases than XLM-RoBERTa. Furthermore, both models have different performances for Temporal Order test cases, especially standard cases with ADE label.
  • Figure 4: Performance of XLM-Roberta on test cases by drug name and by capability. The number of test cases per capability and drug name is 1440 (Temporal Order), 540 (Positive Sentiment), 48 (Beneficial Effect), 225 (Negation).
  • Figure 5: Results of the CheckList tests on the fine-tuned BioRedditBERT by template. The ratio of correctly classified test cases per template is shown on the horizontal axis. Each plot is a histogram showing the count of templates that produced more or less successfully classified test cases.
  • ...and 1 more figures