Table of Contents
Fetching ...

Can ensembles improve evidence recall? A case study

Katharina Beckh, Sven Heuser, Stefan Rüping

TL;DR

This paper addresses the challenge of recovering complete evidence for model predictions in regulated settings by testing ensembles of transformer-based models. It compares unsupervised input-gradient regularization (IGR) with supervised Evidence-Guided Training (EGT) on the MDACE medical-coding task, evaluating how aggregating evidence across models impacts recall and precision. The key finding is that ensemble evidence aggregation substantially increases recall (up to 0.87) compared to the best single model, though it increases false positives, while EGT generally yields higher precision than IGR. The work highlights practical implications for compliance and cataloging tasks and emphasizes the need for larger, fully annotated datasets to support exhaustive evidence extraction.

Abstract

Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications, such as compliance and cataloging, the full set of contributing features must be identified: complete evidence. We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models. We examine different ensemble sizes, the effect of evidence-guided training, and provide qualitative insights.

Can ensembles improve evidence recall? A case study

TL;DR

This paper addresses the challenge of recovering complete evidence for model predictions in regulated settings by testing ensembles of transformer-based models. It compares unsupervised input-gradient regularization (IGR) with supervised Evidence-Guided Training (EGT) on the MDACE medical-coding task, evaluating how aggregating evidence across models impacts recall and precision. The key finding is that ensemble evidence aggregation substantially increases recall (up to 0.87) compared to the best single model, though it increases false positives, while EGT generally yields higher precision than IGR. The work highlights practical implications for compliance and cataloging tasks and emphasizes the need for larger, fully annotated datasets to support exhaustive evidence extraction.

Abstract

Feature attribution methods typically provide minimal sufficient evidence justifying a model decision. However, in many applications, such as compliance and cataloging, the full set of contributing features must be identified: complete evidence. We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models. We examine different ensemble sizes, the effect of evidence-guided training, and provide qualitative insights.

Paper Structure

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of complete evidence extraction on the medical coding task. An ensemble approach, aggregating the evidence of several models, leads to higher recall and, thus, more complete evidence.
  • Figure 2: Recall for different ensemble sizes including all possible model combinations. Red diamond indicates mean value, whiskers show minimum and maximum values.