Table of Contents
Fetching ...

Benchmarking Spurious Bias in Few-Shot Image Classifiers

Guangtao Zheng, Wenqian Ye, Aidong Zhang

TL;DR

A systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias, and can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness.

Abstract

Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.

Benchmarking Spurious Bias in Few-Shot Image Classifiers

TL;DR

A systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias, and can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness.

Abstract

Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at https://github.com/gtzheng/FewSTAB.
Paper Structure (24 sections, 9 equations, 9 figures, 17 tables)

This paper contains 24 sections, 9 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: Exploiting the spurious correlation between the class bird and the spurious attribute tree branch to predict bird leads to an incorrect prediction on the test image showing birds on a grass field. For clarity, we only show the case for one class.
  • Figure 2: FewSTAB overview. (a) Extract distinct attributes using a pre-trained VLM. (b) Generate an FSC task for the evaluation of spurious bias in few-shot classifiers.
  • Figure 3: A 5-way 1-shot task constructed by our inter-class attribute-based sample selection using samples from the miniImageNet dataset. Note that due to the limited capacity of a VLM, the attributes may not well align with human understandings.
  • Figure 4: Accuracy gaps (wAcc-R minus wAcc-A) on the 5-way 1-shot and 5-way 5-shot tasks from the (a) miniImageNet, (b) tieredImageNet, and (c) CUB-200 datasets.
  • Figure 5: Acc versus wAcc-A of the ten FSC methods tested on 5-way 5-shot tasks from miniImageNet.
  • ...and 4 more figures

Theorems & Definitions (1)

  • definition thmcounterdefinition