Table of Contents
Fetching ...

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Chau Pham, Truong Vu, Khoi Nguyen

TL;DR

This work tackles open-vocabulary object detection by learning a linear probe for novel classes on Faster R-CNN features pretrained on base classes and using pseudo-labels from top proposals to supervise a novel sigmoid classifier. It replaces the standard softmax with a sigmoid head to enable independent class scoring and introduces a distillation head to align with CLIP representations, yielding a unified base+novel classifier at inference. The approach achieves state-of-the-art results on COCO with an AP_novel of 40.5 (ResNet50) without external data, and exhibits strong ablations and transfer ability to LVIS, Objects365, and PASCAL VOC. The method offers practical advantages for rapid adaptation to new classes in real-world deployments where novel labels may arrive on the fly. $AP_{novel}=40.5$ on COCO demonstrates the method's effectiveness in extending detections to unseen categories while preserving base-class performance.

Abstract

This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

TL;DR

This work tackles open-vocabulary object detection by learning a linear probe for novel classes on Faster R-CNN features pretrained on base classes and using pseudo-labels from top proposals to supervise a novel sigmoid classifier. It replaces the standard softmax with a sigmoid head to enable independent class scoring and introduces a distillation head to align with CLIP representations, yielding a unified base+novel classifier at inference. The approach achieves state-of-the-art results on COCO with an AP_novel of 40.5 (ResNet50) without external data, and exhibits strong ablations and transfer ability to LVIS, Objects365, and PASCAL VOC. The method offers practical advantages for rapid adaptation to new classes in real-world deployments where novel labels may arrive on the fly. on COCO demonstrates the method's effectiveness in extending detections to unseen categories while preserving base-class performance.

Abstract

This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving in using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.
Paper Structure (11 sections, 5 equations, 7 figures, 8 tables)

This paper contains 11 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of box predictions for novel classes 'bus' and 'cake' between ViLD gu2021open (top) and our approach (bottom). In the ViLD results, low-quality boxes have similar scores to high-quality ones, leading to high false positive (left) and false negative rates (right). Our approach significantly improves the detection performance in both cases by using classification scores instead of similarity scores as in VilD.
  • Figure 2: The feature embeddings of COCO novel classes are extracted from the penultimate layer of a Faster R-CNN pretrained on base classes. These embeddings are highly discriminative, which motivates us to learn a robust classifier from them.
  • Figure 3: Overview of our approach. LP-OVOD starts from the given ROI features extracted from Faster R-CNN ren2015faster with the same prior steps. In the pretraining step (left), a distillation head is added to mimic the prediction of CLIP's image encoder as in VilD gu2021open. Furthermore, the softmax classifier is replaced with a sigmoid classifier and trained with the GT labels for the base classes. In the linear probing step (middle), a new sigmoid classifier with a learnable linear layer is trained on the pseudo labels of the novel classes. The pseudo labels are obtained by retrieving the top box proposals from the given novel text embedding. In the inference step (right), we simply concatenate the weights of the two sigmoid classifiers together to form a unified sigmoid classifier for both base and novel classes where the score of each class is predicted independently. Finally, the classification scores are combined with the distillation score to form the final score for detection.
  • Figure 4: Top-4 box proposal retrievals from CLIP's embeddings of four novel classes: 'elephant', 'dog', and 'knife'. The quality is good enough to be used as pseudo labels for training a few-shot classifier on novel classes.
  • Figure 5: Qualitative comparison of different approaches on COCO's novel classes. The first four columns show our superior performance while the last one shows a failure case where all of them cannot generate boxes for the airplane due to its rare aspect ratio.
  • ...and 2 more figures