Can ensembles improve evidence recall? A case study
Katharina Beckh, Sven Heuser, Stefan Rüping
TL;DR
This paper addresses the challenge of recovering complete evidence for model predictions in regulated settings by testing ensembles of transformer-based models. It compares unsupervised input-gradient regularization (IGR) with supervised Evidence-Guided Training (EGT) on the MDACE medical-coding task, evaluating how aggregating evidence across models impacts recall and precision. The key finding is that ensemble evidence aggregation substantially increases recall (up to 0.87) compared to the best single model, though it increases false positives, while EGT generally yields higher precision than IGR. The work highlights practical implications for compliance and cataloging tasks and emphasizes the need for larger, fully annotated datasets to support exhaustive evidence extraction.
Abstract
Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications, such as compliance and cataloging, the full set of contributing features must be identified: complete evidence. We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models. We examine different ensemble sizes, the effect of evidence-guided training, and provide qualitative insights.
