Table of Contents
Fetching ...

Improving Open-World Object Localization by Discovering Background

Ashish Singh, Michael J. Jones, Kuan-Chuan Peng, Anoop Cherian, Moitreya Chatterjee, Erik Learned-Miller

TL;DR

Open-world object localization (OWOL) requires localizing all objects, including unseen categories, in a single image. The authors introduce Background-aware Open-World Localization (BOWL), which learns a robust objectness signal by discovering non-object regions through an exemplar-based background codebook built from self-supervised patch features (DINO ViT) and applying cosine similarity to identify negative anchors during training. This non-object supervision is integrated into a Faster-RCNN–style detector, yielding improvements in AR@100 for unseen classes across cross-category and cross-dataset benchmarks, e.g., AR$_{N@100}$ gains of several percentage points. The approach enhances recall for novel objects and generalizes across domain shifts, suggesting practical impact for open-world perception in autonomous systems and robotics.

Abstract

Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.

Improving Open-World Object Localization by Discovering Background

TL;DR

Open-world object localization (OWOL) requires localizing all objects, including unseen categories, in a single image. The authors introduce Background-aware Open-World Localization (BOWL), which learns a robust objectness signal by discovering non-object regions through an exemplar-based background codebook built from self-supervised patch features (DINO ViT) and applying cosine similarity to identify negative anchors during training. This non-object supervision is integrated into a Faster-RCNN–style detector, yielding improvements in AR@100 for unseen classes across cross-category and cross-dataset benchmarks, e.g., AR gains of several percentage points. The approach enhances recall for novel objects and generalizes across domain shifts, suggesting practical impact for open-world perception in autonomous systems and robotics.

Abstract

Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.

Paper Structure

This paper contains 21 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An example of a training image showing a ground truth annotated object (green box), unlabeled objects that are not in the known training classes (purple boxes), and clear non-object/background regions (red boxes) that are automatically classified as non-object regions by BOWL (ours) and used in training.
  • Figure 2: Exemplar selection works by first splitting the $t+1^{st}$ image into fixed-size patches and computing a feature embedding (using DINO, represented by the function, $f$) for each patch (represented by the grid of light blue cubes). Each patch is compared to the existing exemplars from the first $t$ images, $E(t)$, using the similarity function $S$. $E(0)$ is initialed to the empty set. Patch embeddings that have below threshold similarity to all exemplars in $E(t)$ (represented by dark blue cubes) are then added to the exemplar set to yield $E(t+1)$. This repeats until all training images are processed. In addition, each exemplar in $E(t)$ maintains a count of the number of times it was the most similar exemplar to a training patch embedding to keep track of each exemplar's cluster size.
  • Figure 3: Overview of training process with BOWL. First the input image on the left is passed through the feature pyramid network (FPN) backbone. Next, anchor boxes are generated which cover the image with regions of different positions, aspect ratios and scales. The anchor boxes are compared with the ground truth known object boxes to generate a set of labeled object boxes (shown in green). With BOWL, a DINO embedding for each image patch within an anchor box is also compared to the non-object exemplar set using cosine similarity, S, and embeddings that have similarity above a certain threshold to an exemplar are used as non-object regions for training.
  • Figure 4: Qualitative results of (a) GGN wang2019region, (b) OLN kim2021oln, and (c) BOWL on MS-COCO validation images. The shown results are all predictions with objectness score higher than $0.75$ generated from models trained on VOC categories. From the results we can see that while both OLN and GGN are able to localize unseen novel objects, there are significant false positive and false negative predictions. Specifically, because of noisy pseudo-annotations, GGN incorrectly predicts non-object regions as objects with high objectness score (false predictions on floor and bed in first row while wall and in second row). OLN on the other hand is not able to predict objects with shapes and scales not present in training data due to only using object supervision (false negative prediction of bed in first row and laptop in second row). BOWL mitigates the above issues by utilizing non-object supervision leading to better localization of objects. We provide more results in supplementary.
  • Figure 5: Base class evaluation on ADE20K
  • ...and 5 more figures