Table of Contents
Fetching ...

Learning to Detect and Segment for Open Vocabulary Object Detection

Tao Wang, Nan Li

TL;DR

Open vocabulary object detection seeks to recognize objects outside the training categories by leveraging vision-language embeddings. CondHead conditions the box regression and mask segmentation heads on semantic embeddings, combining a dynamically aggregated set of static expert heads with dynamically generated parameters to enable class-specific, yet generalizable predictions. Trained on base categories, CondHead demonstrates consistent improvements over state-of-the-art open-vocabulary detectors on COCO and LVIS, including cross-dataset transfers, with only modest overhead. These results underscore the value of semantic-conditioned dynamic heads for bridging base and novel categories and suggest further gains by refining semantic prompts.

Abstract

Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

Learning to Detect and Segment for Open Vocabulary Object Detection

TL;DR

Open vocabulary object detection seeks to recognize objects outside the training categories by leveraging vision-language embeddings. CondHead conditions the box regression and mask segmentation heads on semantic embeddings, combining a dynamically aggregated set of static expert heads with dynamically generated parameters to enable class-specific, yet generalizable predictions. Trained on base categories, CondHead demonstrates consistent improvements over state-of-the-art open-vocabulary detectors on COCO and LVIS, including cross-dataset transfers, with only modest overhead. These results underscore the value of semantic-conditioned dynamic heads for bridging base and novel categories and suggest further gains by refining semantic prompts.

Abstract

Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.
Paper Structure (11 sections, 12 equations, 8 figures, 11 tables)

This paper contains 11 sections, 12 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Illustration of our main intuition. Given the object proposals, the bounding box regression and mask segmentation learned from some object categories could generalize to the target category. For example, the knowledge learned from a chicken could help detect and segment the long thin feet and the small head of an ibis (upper row). Similarly for the hairbrush, the knowledge learned from the toothbrush could better handle the extreme aspect ratio and occlusion from the hand (lower row).
  • Figure 2: Overview of CondHead. To detect objects of novel categories, we aim at conditionally parameterizing the bounding box regression and mask segmentation based on the semantic embedding, which is strongly correlated with the visual feature and provides effective class-specific cues to refine the box and predict the mask.
  • Figure 3: Qualitative comparison with baseline ViLD gu2021open. The bounding box regression and mask segmentation results are overlaid on the images (Yellow: Proposals. Green: Regressed bounding box. Blue: segmentation mask). Best viewed with zoom-in.
  • Figure 4: Effect of tuning language descriptions. We select some intriguing examples for which tuning the input language descriptions could deteriorate or help object detection. Yellow: Proposals. Green: Regressed bounding box.
  • Figure 5: Component analysis. Effect of expert number, $\lambda$ and $\mu$.
  • ...and 3 more figures