SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
Mingxuan Liu, Tyler L. Hayes, Elisa Ricci, Gabriela Csurka, Riccardo Volpi
TL;DR
The paper addresses the sensitivity of open-vocabulary object detectors to vocabulary granularity. It introduces SHiNe, a training-free Semantic Hierarchy Nexus that offline constructs a hierarchy-aware classifier by retrieving super-/sub-categories, encoding them into Is-A sentences, and fusing their embeddings into a nexus vector for each class, enabling detections via $s_m(c,\mathbf{z}_m)=\langle \mathbf{n}_c, \mathbf{z}_m\rangle$ with $O(c)$ complexity. SHiNe demonstrates robust improvements across diverse granularities on iNatLoc and FSOD (up to +31.9 mAP50) and extends to ImageNet-1k open-vocabulary classification with gains up to +2.8% accuracy, while maintaining inference speed comparable to baseline detectors. The approach works with ground-truth hierarchies or LLM-generated hierarchies, making it broadly applicable to real-world, safety-critical scenarios, and its code is open-source for easy integration with existing OvOD systems.
Abstract
Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task, empowering users to freely define their class vocabularies of interest during inference. However, our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities, posing a concern for real-world deployment. To this end, we introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies, while retaining improvements using hierarchies generated by large language models. Moreover, when applied to open-vocabulary classification on ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector, without incurring additional computational overhead during inference. The code is open source.
