The Elephant in the Room
Amir Rosenfeld, Richard Zemel, John K. Tsotsos
TL;DR
The paper demonstrates that state-of-the-art object detectors exhibit unstable and non-local behavior when objects are transplanted between images. By generating thousands of test images through object transplanting and evaluating several detectors, it reveals not only localized misclassifications but also distant changes in detections and scene interpretation. The authors discuss potential root causes, including feature interference in ROI pooling, contextual reasoning, and non-maximum suppression dynamics, highlighting fundamental robustness gaps. The work provides a diagnostic framework and motivates architectural considerations to improve detector resilience against such perturbations. Overall, it exposes critical vulnerabilities in current detection pipelines and points toward directions for more robust contextual and feature integration.
Abstract
We showcase a family of common failures of state-of-the art object detectors. These are obtained by replacing image sub-regions by another sub-image that contains a trained object. We call this "object transplanting". Modifying an image in this manner is shown to have a non-local impact on object detection. Slight changes in object position can affect its identity according to an object detector as well as that of other objects in the image. We provide some analysis and suggest possible reasons for the reported phenomena.
