Table of Contents
Fetching ...

Does calibration mean what they say it means; or, the reference class problem rises again

Lily Hu

TL;DR

The paper challenges the normative claim that calibration within groups yields fair treatment by relying on the reference class problem, arguing that individuals belong to multiple groups and that calibration within any single group cannot guarantee a stable interpretation for an individual score. It shows that the Same Meaning picture presupposes a solution to the reference class problem, which is both unargued and likely incorrect, and that increasing the granularity of groups via multicalibration does not fully resolve this issue. Through Eva's base rate tracking example, the author illustrates broader methodological pitfalls of relying on stylized toy cases to justify group-based fairness criteria. The piece advocates rethinking fairness metrics beyond single-group calibration and emphasizes the need to account for intersectionality and the limits of using group averages to adjudicate individual treatment.

Abstract

Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores "mean." On the Same Meaning picture, group-calibrated scores "mean the same thing" (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the "meaning" or evidential value of an individual's score, because they are a member of that group. The reference class answer it presumes does not only lack justification; it is very likely wrong. I then show that the reference class problem besets not just calibration but other group statistical criteria that claim a close connection to fairness. Reflecting on the origins of this oversight opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.

Does calibration mean what they say it means; or, the reference class problem rises again

TL;DR

The paper challenges the normative claim that calibration within groups yields fair treatment by relying on the reference class problem, arguing that individuals belong to multiple groups and that calibration within any single group cannot guarantee a stable interpretation for an individual score. It shows that the Same Meaning picture presupposes a solution to the reference class problem, which is both unargued and likely incorrect, and that increasing the granularity of groups via multicalibration does not fully resolve this issue. Through Eva's base rate tracking example, the author illustrates broader methodological pitfalls of relying on stylized toy cases to justify group-based fairness criteria. The piece advocates rethinking fairness metrics beyond single-group calibration and emphasizes the need to account for intersectionality and the limits of using group averages to adjudicate individual treatment.

Abstract

Discussions of statistical criteria for fairness commonly convey the normative significance of calibration within groups by invoking what risk scores "mean." On the Same Meaning picture, group-calibrated scores "mean the same thing" (on average) across individuals from different groups and accordingly, guard against disparate treatment of individuals based on group membership. My contention is that calibration guarantees no such thing. Since concrete actual people belong to many groups, calibration cannot ensure the kind of consistent score interpretation that the Same Meaning picture implies matters for fairness, unless calibration is met within every group to which an individual belongs. Alas only perfect predictors may meet this bar. The Same Meaning picture thus commits a reference class fallacy by inferring from calibration within some group to the "meaning" or evidential value of an individual's score, because they are a member of that group. The reference class answer it presumes does not only lack justification; it is very likely wrong. I then show that the reference class problem besets not just calibration but other group statistical criteria that claim a close connection to fairness. Reflecting on the origins of this oversight opens a wider lens onto the predominant methodology in algorithmic fairness based on stylized cases.

Paper Structure

This paper contains 11 sections, 2 equations, 1 figure.

Figures (1)

  • Figure 1: Eva's table from "Algorithmic fairness and base rate tracking," 249.

Theorems & Definitions (2)

  • Definition 1: Calibration
  • Definition 2: Calibration Within Groups