Table of Contents
Fetching ...

SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

Mingxuan Liu, Tyler L. Hayes, Elisa Ricci, Gabriela Csurka, Riccardo Volpi

TL;DR

The paper addresses the sensitivity of open-vocabulary object detectors to vocabulary granularity. It introduces SHiNe, a training-free Semantic Hierarchy Nexus that offline constructs a hierarchy-aware classifier by retrieving super-/sub-categories, encoding them into Is-A sentences, and fusing their embeddings into a nexus vector for each class, enabling detections via $s_m(c,\mathbf{z}_m)=\langle \mathbf{n}_c, \mathbf{z}_m\rangle$ with $O(c)$ complexity. SHiNe demonstrates robust improvements across diverse granularities on iNatLoc and FSOD (up to +31.9 mAP50) and extends to ImageNet-1k open-vocabulary classification with gains up to +2.8% accuracy, while maintaining inference speed comparable to baseline detectors. The approach works with ground-truth hierarchies or LLM-generated hierarchies, making it broadly applicable to real-world, safety-critical scenarios, and its code is open-source for easy integration with existing OvOD systems.

Abstract

Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.

SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection

TL;DR

The paper addresses the sensitivity of open-vocabulary object detectors to vocabulary granularity. It introduces SHiNe, a training-free Semantic Hierarchy Nexus that offline constructs a hierarchy-aware classifier by retrieving super-/sub-categories, encoding them into Is-A sentences, and fusing their embeddings into a nexus vector for each class, enabling detections via with complexity. SHiNe demonstrates robust improvements across diverse granularities on iNatLoc and FSOD (up to +31.9 mAP50) and extends to ImageNet-1k open-vocabulary classification with gains up to +2.8% accuracy, while maintaining inference speed comparable to baseline detectors. The approach works with ground-truth hierarchies or LLM-generated hierarchies, making it broadly applicable to real-world, safety-critical scenarios, and its code is open-source for easy integration with existing OvOD systems.

Abstract

Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
Paper Structure (26 sections, 5 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: (Top) Classifier comparison for open-vocabulary object detectors: (Left) standard methods use solely class names in the vocabulary specified by the user to extract text embeddings; (Right) our proposed SHiNe fuses information from super-/sub-categories into nexus points to generate hierarchy-aware representations. (Bottom) Open-vocabulary detection performance at different levels of vocabulary granularity specified by users: A standard Baseline under-performs and presents significant variability; SHiNe allows for improved and more uniform performance across various vocabularies. Results are on the iNatLoc cole2022label dataset.
  • Figure 2: Overview of our method. (Top) SHiNe constructs the semantic hierarchy nexus classifier in three steps offline: (1) For each target class (e.g., "Bat" in green) in the given vocabulary, we query the associated super-(in blue)/sub-(in pink) categories from a semantic hierarchy. (2) These retrieved categories along with their interrelationships are integrated into a set of hierarchy-aware sentences using our proposed Is-A connector. (3) These sentences are then encoded by a frozen VLM text encoder (e.g., CLIP radford2021learning) and subsequently fused using an aggregator (e.g., mean-aggregator) to form a nexus classifier vector for the target class. (Bottom): The constructed classifier is directly applied to an off-the-shelf OvOD detector for inference, enhancing its robustness across various levels of vocabulary granularity.
  • Figure 3: Study of hierarchy-aware sentence integration methods (left) and aggregators (right) across various label granularity levels on the iNatLoc dataset. Detic with a Swin-B backbone is used as the baseline. Darker background color indicates higher mAP50. The default components of SHiNe are underlined. Note that the experiment in (a) omits sub-categories and the aggregation step.
  • Figure 4: Analysis of OvOD detection performance under noisy mis-specified label vocabularies on iNatLoc (left) and FSOD (right) datasets. We assess the detection performance of both the baseline detector (in grey) and our method (in green) under varied supervision signals, contrasting results between the original (origin=c]45$\square$) and the expanded mis-specified ($\bigcirc$) vocabularies. SHiNe employs the LLM-generated hierarchy for both vocabularies. We report mAP50, highlighting the performance drop ($\Delta$).
  • Figure 5: Examples of integrating hierarchy-aware sentences with different hierarchy structures. We use "Bat" as an example of the target Class of Interest (CoI) for example. The retrieved super-/sub-categories and the target CoI are color-coded in blue and red, and green, respectively. (a) The target CoI is linked to a unique super-category at each higher hierarchy level and multiple sub-categories at each lower level, akin to the ground-truth hierarchy structure of the datasets. (b) The target CoI is associated with multiple super-categories at the upper hierarchy level and multiple sub-categories at the lower level, akin to the simple three-level LLM-generated hierarchy structures.
  • ...and 4 more figures