Improving Open-World Object Localization by Discovering Background
Ashish Singh, Michael J. Jones, Kuan-Chuan Peng, Anoop Cherian, Moitreya Chatterjee, Erik Learned-Miller
TL;DR
Open-world object localization (OWOL) requires localizing all objects, including unseen categories, in a single image. The authors introduce Background-aware Open-World Localization (BOWL), which learns a robust objectness signal by discovering non-object regions through an exemplar-based background codebook built from self-supervised patch features (DINO ViT) and applying cosine similarity to identify negative anchors during training. This non-object supervision is integrated into a Faster-RCNN–style detector, yielding improvements in AR@100 for unseen classes across cross-category and cross-dataset benchmarks, e.g., AR$_{N@100}$ gains of several percentage points. The approach enhances recall for novel objects and generalizes across domain shifts, suggesting practical impact for open-world perception in autonomous systems and robotics.
Abstract
Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
