Table of Contents
Fetching ...

Open-Vocabulary Object Detection via Language Hierarchy

Jiaxing Huang, Jingyi Zhang, Kai Jiang, Shijian Lu

TL;DR

Language Hierarchical Self-training (LHST) is designed that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors and co-regularization between the expanded labels and self-training.

Abstract

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels. However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-level labels do not convey precise object information. We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors. LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability. In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing. Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.

Open-Vocabulary Object Detection via Language Hierarchy

TL;DR

Language Hierarchical Self-training (LHST) is designed that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors and co-regularization between the expanded labels and self-training.

Abstract

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels. However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-level labels do not convey precise object information. We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors. LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability. In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing. Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.

Paper Structure

This paper contains 10 sections, 9 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Image-level labels in large-scale datasets such as ImageNet-21k deng2009imagenet often do not convey precise object information rasheed2022bridgingzhou2022detecting which affects while learning generalizable detectors. Recent methods tackle this issue by various label-to-box assignment strategies bilen2016weaklyredmon2017yolo9000ramanathan2020dlwlzhou2022detecting as in (a) but are heavily restricted by raw image-level labels and still suffer from image-to-box label mismatch rasheed2022bridging. Self-training sohn2020simple with the detectors pre-trained with redmon2017yolo9000ramanathan2020dlwlzhou2022detecting could circumvent the label mismatch issue but the generated pseudo box labels are error-prone due to the lack of proper supervision as in (b). Our proposed LHST introduces language hierarchy to expand the image-level labels and enables co-regularization between the expanded labels and self-training which allows producing more accurate pseudo box labels in (c).
  • Figure 2: The proposed language hierarchical self-training consists of two flows including Pseudo Label Generation (top box) and Training with Generated Labels (bottom box). The Pseudo Label Generation flow leverages WordNet to expand the image-level labels, and then merges the expanded image-level labels with the predicted pseudo box labels, such that the expanded image-level labels could provide richer and more flexible supervision (than the limited and rigid raw labels) to regularize the self-training which is prone to errors in pseudo labeling. In addition, as the labels expanded by WordNet (i.e., the expanded logits '1' in $y_{image}^{hier}$ and $y_{box}^{hier}$) are not all reliable, Pseudo Label Generation predicts reliability scores for the expanded labels to adaptively re-weight them when applying them on different images or pseudo boxes. In Training with Generated Labels, we optimize the detector with the generated image-level and box-level labels, where the image-level training could regularize the training with pseudo box-level labels as pseudo box labels vary along training iterations and are not very stable.