Table of Contents
Fetching ...

A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

TL;DR

Open-vocabulary object detection struggles to generalize to novel classes when trained only on base annotations. The authors present HD-OVD, a hierarchical semantic distillation framework that transfers knowledge from CLIP across instance, class, and image levels, using pseudo boxes and caption-based pseudo labels to cover unseen categories. Across OV-COCO and OV-LVIS, HD-OVD achieves state-of-the-art $AP_n$ and strong cross-dataset generalization to COCO and Objects365, validating the benefit of integrated, multi-level distillation. This approach offers a practical path to robust open-vocabulary recognition with flexible backbone choices and efficient inference.

Abstract

Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectively. In this work, we propose a hierarchical semantic distillation framework named HD-OVD to construct a comprehensive distillation process, which exploits generalizable knowledge from the CLIP model in three aspects. In the first hierarchy of HD-OVD, the detector learns fine-grained instance-wise semantics from the CLIP image encoder by modeling relations among single objects in the visual space. Besides, we introduce text space novel-class-aware classification to help the detector assimilate the highly generalizable class-wise semantics from the CLIP text encoder, representing the second hierarchy. Lastly, abundant image-wise semantics containing multi-object and their contexts are also distilled by an image-wise contrastive distillation. Benefiting from the elaborated semantic distillation in triple hierarchies, our HD-OVD inherits generalizable recognition ability from CLIP in instance, class, and image levels. Thus, we boost the novel AP on the OV-COCO dataset to 46.4% with a ResNet50 backbone, which outperforms others by a clear margin. We also conduct extensive ablation studies to analyze how each component works.

A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection

TL;DR

Open-vocabulary object detection struggles to generalize to novel classes when trained only on base annotations. The authors present HD-OVD, a hierarchical semantic distillation framework that transfers knowledge from CLIP across instance, class, and image levels, using pseudo boxes and caption-based pseudo labels to cover unseen categories. Across OV-COCO and OV-LVIS, HD-OVD achieves state-of-the-art and strong cross-dataset generalization to COCO and Objects365, validating the benefit of integrated, multi-level distillation. This approach offers a practical path to robust open-vocabulary recognition with flexible backbone choices and efficient inference.

Abstract

Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, eg, CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectively. In this work, we propose a hierarchical semantic distillation framework named HD-OVD to construct a comprehensive distillation process, which exploits generalizable knowledge from the CLIP model in three aspects. In the first hierarchy of HD-OVD, the detector learns fine-grained instance-wise semantics from the CLIP image encoder by modeling relations among single objects in the visual space. Besides, we introduce text space novel-class-aware classification to help the detector assimilate the highly generalizable class-wise semantics from the CLIP text encoder, representing the second hierarchy. Lastly, abundant image-wise semantics containing multi-object and their contexts are also distilled by an image-wise contrastive distillation. Benefiting from the elaborated semantic distillation in triple hierarchies, our HD-OVD inherits generalizable recognition ability from CLIP in instance, class, and image levels. Thus, we boost the novel AP on the OV-COCO dataset to 46.4% with a ResNet50 backbone, which outperforms others by a clear margin. We also conduct extensive ablation studies to analyze how each component works.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: The illustration of three-level semantic knowledge modeling in HD-OVD. It consists of (a) instance-wise relation modeling, (b) class-wise novel-class-aware classification, and (c) image-wise contrastive distillation. Distilling hierarchical semantics from CLIP equips HD-OVD with strong open-vocabulary ability. Base and novel class boxes are colored in yellow and blue. Best viewed in color.
  • Figure 2: t-SNE plots of RoI features for each model. Different colors represent different classes. The model trained with base classes struggles to separate similar base and novel classes, as shown in red circles. Our HD-OVD can separate them clearly.
  • Figure 3: The overview of HD-OVD, which distills semantics from CLIP to a query-based detector hierarchically. At the image level, the backbone feature of the image is aligned with the CLIP global image feature. At the instance level, relations among various single instances from the CLIP image encoder are transferred to the detector. The class-level distillation conducts a novel-class-aware classification to enable the detector to inherit the high-level semantics from the CLIP text encoder. To save training computation cost, all CLIP features are pre-extracted in an offline manner. Best viewed in color.
  • Figure 4: The pipeline for generating pseudo text labels. We first generate captions for each region based on CLIP region features. Then, nouns are extracted by a grammar parser. The noun having a max CLIP similarity with the region features is selected as the final pseudo text label.
  • Figure 5: Visualization of generated captions and pseudo labels. Pseudo boxes are marked in blue boxes. The correct and incorrect pseudo labels are colored in green and red. The powerful BLIP-2 model can generate more accurate pseudo text labels than ClipCap.
  • ...and 1 more figures