Table of Contents
Fetching ...

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Shuailei Ma, Yuefeng Wang, Ying Wei, Jiaqi Fan, Enming Zhang, Xinyu Sun, Peihao Chen

TL;DR

This work tackles open-world object detection by distilling open-world knowledge from large vision-language models into a language-agnostic detector. It introduces SKDF, a framework that uses a down-weight loss to mitigate forgetting and a cascade decoupled decoding architecture to separate localization from recognition, enabling robust unknown-object detection. The authors also propose two benchmarks, StandardSet^♥ and IntensiveSet^♠, to rigorously evaluate unknown-object detection in open-world scenarios. Experimental results across OWOD and MS-COCO splits show that SKDF surpasses the teacher and existing SOTA methods in unknown-object detection while maintaining strong performance on known classes, and the approach runs with faster inference and smaller models. These contributions offer a practical path to leveraging open-world knowledge for real-world open-world detectors and set the stage for broader evaluation of unknown-object discovery.

Abstract

In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the \textbf{down-weight loss function} for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the \textbf{cascade decouple decoding structure} that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name "\textbf{StandardSet}$\heartsuit$" and "\textbf{IntensiveSet}$\spadesuit$" respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at \url{https://github.com/xiaomabufei/SKDF}.

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

TL;DR

This work tackles open-world object detection by distilling open-world knowledge from large vision-language models into a language-agnostic detector. It introduces SKDF, a framework that uses a down-weight loss to mitigate forgetting and a cascade decoupled decoding architecture to separate localization from recognition, enabling robust unknown-object detection. The authors also propose two benchmarks, StandardSet^♥ and IntensiveSet^♠, to rigorously evaluate unknown-object detection in open-world scenarios. Experimental results across OWOD and MS-COCO splits show that SKDF surpasses the teacher and existing SOTA methods in unknown-object detection while maintaining strong performance on known classes, and the approach runs with faster inference and smaller models. These contributions offer a practical path to leveraging open-world knowledge for real-world open-world detectors and set the stage for broader evaluation of unknown-object discovery.

Abstract

In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the \textbf{down-weight loss function} for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the \textbf{cascade decouple decoding structure} that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name "\textbf{StandardSet}" and "\textbf{IntensiveSet}" respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at \url{https://github.com/xiaomabufei/SKDF}.
Paper Structure (34 sections, 10 equations, 6 figures, 9 tables)

This paper contains 34 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: SKDF leverages the proposed down-weight training strategy to distill open-world knowledge from the large open-vocabulary pre-trainied vision-language model to the expert open-world detector with faster-detecting speed and better performance via small amounts of data.
  • Figure 2: Overall scheme of the proposed framework.(a) illustrates the lifespan of the cascade open-world object detector where the model detects known objects and potential unknowns, with human annotators progressively labeling some unknown classes, the model incrementally updates its knowledge using these new labels without fully retraining. (b) exhibits the down-weight training strategy which leverages the objectness to separate the learning weight of the annotated known knowledge, distilled open-world knowledge, and searched pseudo-open-world unknown knowledge. (c) describes the distillation procedure that leverages the large-scale vocabulary prompt to mine the open-world knowledge in the open-vocabulary vision-language pertaining model.
  • Figure 3: Overall Architecture of proposed cascade decoupled open-world detector. The proposed detector consists of a multi-scale feature extractor, the decoupled cascade transformer decoder, and the regression prediction branch. The multi-scale feature extractor comprises the mainstream feature extraction backbone and a deformable transformer encoder, for extracting multi-scale features. The decoupled cascade transformer decoders are the deformable transformer decoders and decouple the localization and identification process in the cascade way. The regression prediction branch contains the bounding box regression branch $F_{reg}$, novelty objectness branch $F_{obj}$, and novelty classification branch $F_{cls}$. The novelty classification and objectness branches are single-layer feed-forward networks (FFN) and the regression branch is a 3-layer FFN.
  • Figure 4: The detailed data analysis ofStandardSet$\heartsuit$andIntensiveSet$\spadesuit$. In (a), we calculate the area distribution of instances in the two benchmark test scenes, with the vertical axis representing the logarithm of the count with respect to Euler's number e. In (b) and (c), we respectively analyze the aspect ratio and the spatial distribution of the instance bounding box annotations.
  • Figure 5: Qualitative Results. Visualization results are based on the setting of Task.1. Our model can detect the unknown objects in Yellow boxes beyond the unknown labels from GLIP and LVIS text prompts. The animation and games categories in the figures do not appear in the LVIS text prompt and our training dataset so our detector must not learn from GLIP.
  • ...and 1 more figures