Table of Contents
Fetching ...

Noise or Signal: The Role of Image Backgrounds in Object Recognition

Kai Xiao, Logan Engstrom, Andrew Ilyas, Aleksander Madry

TL;DR

This work exposes the extent to which image backgrounds drive object recognition by constructing a foreground-background disentanglement toolkit and a family of IN-9 datasets (including the larger IN-9L). It demonstrates that backgrounds can carry substantial predictive signals, that models are vulnerable to adversarial backgrounds, and that training on mixed-background data reduces reliance on backgrounds while preserving accuracy on real-world data. The authors also analyze how progress on standard benchmarks relates to background dependence and discuss possible robustness strategies, such as distributionally robust optimization. Overall, the study provides a nuanced view of background cues as both a potential aid and a pitfall in modern vision systems, offering a concrete framework to measure and improve robustness to contextual signals.

Abstract

We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.

Noise or Signal: The Role of Image Backgrounds in Object Recognition

TL;DR

This work exposes the extent to which image backgrounds drive object recognition by constructing a foreground-background disentanglement toolkit and a family of IN-9 datasets (including the larger IN-9L). It demonstrates that backgrounds can carry substantial predictive signals, that models are vulnerable to adversarial backgrounds, and that training on mixed-background data reduces reliance on backgrounds while preserving accuracy on real-world data. The authors also analyze how progress on standard benchmarks relates to background dependence and discuss possible robustness strategies, such as distributionally robust optimization. Overall, the study provides a nuanced view of background cues as both a potential aid and a pitfall in modern vision systems, offering a concrete framework to measure and improve robustness to contextual signals.

Abstract

We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.

Paper Structure

This paper contains 26 sections, 30 figures, 5 tables.

Figures (30)

  • Figure 1: Variations of the synthetic dataset ImageNet-9, as described in Table \ref{['table:8datasets']}. We label each image with its pre-trained ResNet-50 classification---green, if corresponding with the original label; red, if not. The model correctly classifies the image as "insect" when given: the original image, only the background, and two cases where the original foreground is present but the background changes. Note that, in particular, the model fails in two cases when the original foreground is present but the background changes (as in Mixed-Next or Only-FG).
  • Figure 2: We train models on each of the "background-only" datasets, then evaluate each on its corresponding test set as well as the Original test set. Even though the model only learns from background signal, it achieves (much) better than random performance on both the corresponding test set and Original. Here, random guessing would give 11.11% (the dotted line).
  • Figure 3: The adversarial backgrounds that most frequently fool IN-9L-trained models into classifying a given foreground as insect, ordered by the percentage of foregrounds fooled. The total portion of images that can be fooled (by any background from this class) is 66.55%.
  • Figure 4: Histogram of insect backgrounds grouped by how often they cause (non-insect) foregrounds to be classified as insect by a IN-9L-trained model. We visualize the five backgrounds that fool the classifier on the largest percentage of images in Figure \ref{['fig:most_fooling']}.
  • Figure 5: We compare the test performance of a model trained on the synthetic Mixed-Rand dataset with a model trained on Original. We evaluate these models on variants of IN-9 that contain identical foregrounds. For the Original-trained model, test performance decreases significantly when the background signal is modified during testing. However, the Mixed-Rand-trained model is robust to background changes, albeit at the cost of lower accuracy on images from Original.
  • ...and 25 more figures