Evaluating the Robustness of Adverse Drug Event Classification Models Using Templates
Dorothea MacPhail, David Harbecke, Lisa Raithel, Sebastian Möller
TL;DR
This paper addresses the evaluation gap in adverse drug event ADE classification by introducing a template-based, behavior-focused testing framework that probes four capabilities: Temporal Order, negation, sentiment, and beneficial effects. By fine-tuning two transformer models on a merged social media ADE dataset and evaluating them with a bench of 99 templates (1505 variations) and 11,265 test cases, the authors demonstrate that similar held-out F1 scores can mask substantive differences in capability robustness. The results reveal that models struggle to distinguish ADEs from beneficial effects and to robustly handle temporal and negation variations, even when overall performance appears strong. The work provides a practical, reproducible template bank for deeper ADE model evaluation and suggests directions for broader capability testing and data augmentation to improve real-world reliability.
Abstract
An adverse drug effect (ADE) is any harmful event resulting from medical drug treatment. Despite their importance, ADEs are often under-reported in official channels. Some research has therefore turned to detecting discussions of ADEs in social media. Impressive results have been achieved in various attempts to detect ADEs. In a high-stakes domain such as medicine, however, an in-depth evaluation of a model's abilities is crucial. We address the issue of thorough performance evaluation in English-language ADE detection with hand-crafted templates for four capabilities: Temporal order, negation, sentiment, and beneficial effect. We find that models with similar performance on held-out test sets have varying results on these capabilities.
