Table of Contents
Fetching ...

The Elephant in the Room

Amir Rosenfeld, Richard Zemel, John K. Tsotsos

TL;DR

The paper demonstrates that state-of-the-art object detectors exhibit unstable and non-local behavior when objects are transplanted between images. By generating thousands of test images through object transplanting and evaluating several detectors, it reveals not only localized misclassifications but also distant changes in detections and scene interpretation. The authors discuss potential root causes, including feature interference in ROI pooling, contextual reasoning, and non-maximum suppression dynamics, highlighting fundamental robustness gaps. The work provides a diagnostic framework and motivates architectural considerations to improve detector resilience against such perturbations. Overall, it exposes critical vulnerabilities in current detection pipelines and points toward directions for more robust contextual and feature integration.

Abstract

We showcase a family of common failures of state-of-the art object detectors. These are obtained by replacing image sub-regions by another sub-image that contains a trained object. We call this "object transplanting". Modifying an image in this manner is shown to have a non-local impact on object detection. Slight changes in object position can affect its identity according to an object detector as well as that of other objects in the image. We provide some analysis and suggest possible reasons for the reported phenomena.

The Elephant in the Room

TL;DR

The paper demonstrates that state-of-the-art object detectors exhibit unstable and non-local behavior when objects are transplanted between images. By generating thousands of test images through object transplanting and evaluating several detectors, it reveals not only localized misclassifications but also distant changes in detections and scene interpretation. The authors discuss potential root causes, including feature interference in ROI pooling, contextual reasoning, and non-maximum suppression dynamics, highlighting fundamental robustness gaps. The work provides a diagnostic framework and motivates architectural considerations to improve detector resilience against such perturbations. Overall, it exposes critical vulnerabilities in current detection pipelines and points toward directions for more robust contextual and feature integration.

Abstract

We showcase a family of common failures of state-of-the art object detectors. These are obtained by replacing image sub-regions by another sub-image that contains a trained object. We call this "object transplanting". Modifying an image in this manner is shown to have a non-local impact on object detection. Slight changes in object position can affect its identity according to an object detector as well as that of other objects in the image. We provide some analysis and suggest possible reasons for the reported phenomena.

Paper Structure

This paper contains 1 section, 3 equations, 9 figures, 2 tables.

Table of Contents

  1. Test Image Generation

Figures (9)

  • Figure 1: Detecting an elephant in a room. A state-of-the-art object detector detects multiple images in a living-room (a). A transplanted object (elephant) can remain undetected in many situations and arbitrary locations (b,d,e,g,i). It can assume incorrect identities such as a chair (f). The object has a non-local effect, causing other objects to disappear (cup, d,f, book, e-i ) or switch identity (chair switches to couch in e). It is recommended to view this image in color online.
  • Figure 2: Effects of transplanting an object from an image into another location in the same image. Top row: original detection. Each subsequent rows: newly detected objected w.r.t to previous row, induced by the translated object copy.
  • Figure 3: Feature Interference. A partially visible cat is detected as a zebra (a). Discarding all pixels outside the detection's bounding box does not fix the object's classification, showing that features inside the region-of-interest (ROI) can cause confusion (b). Discarding also all non-cat pixels inside the ROI leads to a fixed classification (c). Adding random noise in the range outside the bounding box once again makes the detection incorrect , showing the effect of features outside the ROI (d)
  • Figure 4: Non-local effects of object transplant on Google's OCR. A keyboard placed in two different locations in an image causes a different interpretation of the text in the sign on the right. The output for the top image is "dog bi" and for the bottom it is "La Cop"
  • Figure 5: Detection with Transplanted Objects. Top row : original images. Left-to-right: detection of models: faster_rcnn_inception_resnet_v2_atrous_coco, faster_rcnn_nas_coco, ssd_mobilenet_v1_coco, mask_rcnn_inception_resnet_v2_atrous_coco, mask_rcnn_resnet101_atrous_coco. Each row shows only newly added detection w.r.t the previous row in the same column to avoid clutter. Transplanting the bear causes a variety of new objects to be detected, e.g.: chair, car, book (first column); kite, knife, cellphone (second column).
  • ...and 4 more figures