Table of Contents
Fetching ...

Predictive Inequity in Object Detection

Benjamin Wilson, Judy Hoffman, Jamie Morgenstern

TL;DR

The paper examines predictive inequity in pedestrian detection across Fitzpatrick skin-tone groups (LS vs DS) within driving datasets, highlighting consistent underperformance for DS across multiple models and training regimes. It establishes a benchmark using BDD100K with annotated skin tones, defines a loss-based inequity metric, and analyzes potential sources such as occlusion, time of day, and loss prioritization. Key findings show that LS generally achieve higher AP, especially AP75, across architectures like Faster R-CNN and Mask R-CNN, and that simple loss reweighting can partially mitigate the gap. The work underscores the importance of fairness considerations in safety-critical vision systems and suggests that dataset and training adjustments can reduce, but not fully eliminate, predictive inequity, prompting broader strategies for equitable autonomous driving perception.

Abstract

In this work, we investigate whether state-of-the-art object detection systems have equitable predictive performance on pedestrians with different skin tones. This work is motivated by many recent examples of ML and vision systems displaying higher error rates for certain demographic groups than others. We annotate an existing large scale dataset which contains pedestrians, BDD100K, with Fitzpatrick skin tones in ranges [1-3] or [4-6]. We then provide an in-depth comparative analysis of performance between these two skin tone groupings, finding that neither time of day nor occlusion explain this behavior, suggesting this disparity is not merely the result of pedestrians in the 4-6 range appearing in more difficult scenes for detection. We investigate to what extent time of day, occlusion, and reweighting the supervised loss during training affect this predictive bias.

Predictive Inequity in Object Detection

TL;DR

The paper examines predictive inequity in pedestrian detection across Fitzpatrick skin-tone groups (LS vs DS) within driving datasets, highlighting consistent underperformance for DS across multiple models and training regimes. It establishes a benchmark using BDD100K with annotated skin tones, defines a loss-based inequity metric, and analyzes potential sources such as occlusion, time of day, and loss prioritization. Key findings show that LS generally achieve higher AP, especially AP75, across architectures like Faster R-CNN and Mask R-CNN, and that simple loss reweighting can partially mitigate the gap. The work underscores the importance of fairness considerations in safety-critical vision systems and suggests that dataset and training adjustments can reduce, but not fully eliminate, predictive inequity, prompting broader strategies for equitable autonomous driving perception.

Abstract

In this work, we investigate whether state-of-the-art object detection systems have equitable predictive performance on pedestrians with different skin tones. This work is motivated by many recent examples of ML and vision systems displaying higher error rates for certain demographic groups than others. We annotate an existing large scale dataset which contains pedestrians, BDD100K, with Fitzpatrick skin tones in ranges [1-3] or [4-6]. We then provide an in-depth comparative analysis of performance between these two skin tone groupings, finding that neither time of day nor occlusion explain this behavior, suggesting this disparity is not merely the result of pedestrians in the 4-6 range appearing in more difficult scenes for detection. We investigate to what extent time of day, occlusion, and reweighting the supervised loss during training affect this predictive bias.

Paper Structure

This paper contains 25 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Instructions given to mechanical turk annotators for classifying LS and DS people.
  • Figure 2: Annotation interface.
  • Figure 3: Histogram of the annotator responses. Each of the three annotators was given a choice of labeling as Category A (LS -- denoted as L), Category B (DS -- denoted as D), unknown (U), or not a person (N). Only instances with a consensus vote for LS or DS were labeled as such.
  • Figure 4: AP performance gap comparing LS and DS individuals using an unweighted model across training iterations on BDD100K. LS consistently has higher AP then DS people.
  • Figure 5: Example detections from Faster R-CNN using the R-50-FPN backbone, trained on BDD100K. For reference, the ground truth annotations for LS and DS are pink and purple respectively. Yellow boxes correspond to true positives under the AP$_{50}$ metric and false positives under the AP$_{75}$ metric. Green boxes correspond to true positives under the AP$_{75}$ metric. All the predictions shown are greater than an 85% confidence threshold.

Theorems & Definitions (1)

  • Remark 1