Table of Contents
Fetching ...

Explaining CLIP's performance disparities on data from blind/low vision users

Daniela Massiceti, Camilla Longden, Agnieszka Słowik, Samuel Wills, Martin Grayson, Cecily Morrison

TL;DR

It is found that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which are discussed alongside a set of other possible mitigations.

Abstract

Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.

Explaining CLIP's performance disparities on data from blind/low vision users

TL;DR

It is found that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which are discussed alongside a set of other possible mitigations.

Abstract

Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.
Paper Structure (53 sections, 2 equations, 16 figures, 17 tables)

This paper contains 53 sections, 2 equations, 16 figures, 17 tables.

Figures (16)

  • Figure 1: CLIP's zero-shot object recognition accuracy is 15 percentage points lower in images from blv users (ORBIT, VizWiz-Classification) versus web-crawled images (MSCOCO, Open Images). Average accuracy (with 95% c.i.) in a standardized zero-shot image classification task is reported over 80-100K images per dataset for 25 CLIP variants.
  • Figure 2: Examples from the ORBIT Dataset. (top) Disability objects: guide canes, liquid level sensor, electronic Braille device. (middle) Quality issues typical in images: underexposure, blur, camera viewpoint, and framing. (bottom) A remote control and a Victor Reader Stream in a clean and clutter frame.
  • Figure 3: Blur, viewpoint/rotation, occlusion and lighting issues all have large negative marginal effects on model accuracy, with high statistical significance, but these are not compounded for exclusive disability objects. Each dot represents a CLIP variant, with its color showing the significance level.
  • Figure 4: OWL-ViT minderer2022simple detects disability objects less consistently than non-disability objects. Disability objects are often mistaken for other objects, sometimes with higher confidence.
  • Figure 5: DALL-E2 ramesh2022hierarchical either defaults to common objects or fabrications when prompted with disability objects like guide canes and electronic Braille devices. Instead, it generates high-quality images of non-disability objects (see \ref{['app:fig:image-analysis:gen1a']}).
  • ...and 11 more figures