A Causal Framework for Evaluating Deferring Systems

Filippo Palomba; Andrea Pugnana; José Manuel Alvarez; Salvatore Ruggieri

A Causal Framework for Evaluating Deferring Systems

Filippo Palomba, Andrea Pugnana, José Manuel Alvarez, Salvatore Ruggieri

TL;DR

The paper tackles the problem of evaluating deferring systems through a causal lens, moving beyond final accuracy to quantify the causal impact of deferring on predictive performance. It develops a formal mapping between deferring components and the potential outcomes framework, and presents two estimation routes: Scenario 1 with full access to ML and human predictions for deferred instances, and Scenario 2 using a regression discontinuity design to identify local effects at the deferral boundary. It introduces estimands such as the average treatment effect on the deferred ($\tau_{ATD}$) and the RD-based local effect ($\tau_{RD}$), and demonstrates how to estimate them with difference-in-means and local polynomial regression, respectively. Through experiments on a synthetic dataset and four real LtD datasets across seven deferring systems, the framework reveals when deferring to humans improves accuracy, exposes conditional effects (e.g., fairness or gender), and highlights practical considerations for deployment and auditing of human–AI teams.

Abstract

Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems, which allows to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we have access to both the human and ML model predictions for the deferred instances. Here, we can identify the individual causal effects for deferred instances and the aggregates of them. In the second one, only human predictions are available for the deferred instances. Here, we can resort to regression discontinuity designs to estimate a local causal effect. We evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.

A Causal Framework for Evaluating Deferring Systems

TL;DR

) and the RD-based local effect (

), and demonstrates how to estimate them with difference-in-means and local polynomial regression, respectively. Through experiments on a synthetic dataset and four real LtD datasets across seven deferring systems, the framework reveals when deferring to humans improves accuracy, exposes conditional effects (e.g., fairness or gender), and highlights practical considerations for deployment and auditing of human–AI teams.

Abstract

Paper Structure (52 sections, 9 theorems, 23 equations, 4 figures, 2 tables)

This paper contains 52 sections, 9 theorems, 23 equations, 4 figures, 2 tables.

INTRODUCTION
BACKGROUND
Causal Inference
Deferring Systems
Related Work
Policy Evaluation.
Deferring Systems Applications.
EVALUATING DEFERRING SYSTEMS
Methodology.
Scenario 1: deferring systems as an almost perfect causal inference design
Scenario 2: deferring systems as an RD design
EXPERIMENTAL EVALUATION
Experimental settings
Data.
Baselines.
...and 37 more sections

Key Result

Theorem 1

Let Assumption ass:continuity hold. Then:

Figures (4)

Figure 1: In blue, the $({c})\%$ of instances assigned to the ML model; in orange, the $(1-{c})\%$ instances assigned to the the human.
Figure 2: Scenario \ref{['scenario1']} assumptions: thick (dashed) lines are observed (unobserved) values. The coloured area represents where the effects can be estimated.
Figure 3: Scenario \ref{['scenario2']} assumptions: thick (dashed) lines are observed (unobserved) values. We can estimate $\tau_{\mathtt{RD}}$ at the cutoff value.
Figure 4: Experimental results: Figure \ref{['fig:S1res']} reports system accuracy and $\hat{\tau}_{\mathtt{ATD}}$ (Scenario 1); Figure \ref{['fig:S1CATE']} reports estimated $\hat{\tau}_{\mathtt{CATD}}$ when conditioning on the gender of the patient on the xray-airspace dataset; Figure \ref{['fig:S1comparison']} compares $\hat{\tau}_{\mathtt{ATD}}$ over multiple baselines on synth; Figure \ref{['fig:S2res']} reports $\hat{\tau}_{\mathtt{RD}}$ (Scenario 2).

Theorems & Definitions (13)

Theorem 1: Theorem 3 from hahn2001IdentificationEstimationTreatment
Proposition 1
Proposition 2
Proposition 3
Proposition 4
Proposition 1
proof
Proposition 2
proof
Proposition 3
...and 3 more

A Causal Framework for Evaluating Deferring Systems

TL;DR

Abstract

A Causal Framework for Evaluating Deferring Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (13)