Table of Contents
Fetching ...

TARO: Toward Semantically Rich Open-World Object Detection

Yuchen Zhang, Yao Lu, Johannes Betz

TL;DR

TARO tackles open-world object detection by moving beyond labeling unknowns as a single class and instead categorizing them into coarse, semantically meaningful parents within a taxonomy. It extends DETR-based detectors with three key components: a sparsemax-based objectness head that allocates a sparse competition among queries, a hierarchy-aware activation that couples parent-child predictions, and a hierarchy-guided relabeling strategy that provides auxiliary supervision for objectness using non-leaf activations. Empirical results on OWOD and OW-DETR splits show TARO achieves higher unknown recall, reduces confusion between known and unknown objects, and can categorize unknowns with up to 29.9% Hierarchy Accuracy on the OWOD Split, while maintaining competitive known-class mAP. The work also analyzes ablations and discusses future directions, including leveraging Vision-Language Models and multimodal data to further enhance semantic understanding of unknowns in open spaces.

Abstract

Modern object detectors are largely confined to a "closed-world" assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as 'Unknown', this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an 'Unknown Animal' (requiring an urgent stop) versus 'Unknown Debris' (requiring a safe lane change) is far more useful than just 'Unknown' in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.

TARO: Toward Semantically Rich Open-World Object Detection

TL;DR

TARO tackles open-world object detection by moving beyond labeling unknowns as a single class and instead categorizing them into coarse, semantically meaningful parents within a taxonomy. It extends DETR-based detectors with three key components: a sparsemax-based objectness head that allocates a sparse competition among queries, a hierarchy-aware activation that couples parent-child predictions, and a hierarchy-guided relabeling strategy that provides auxiliary supervision for objectness using non-leaf activations. Empirical results on OWOD and OW-DETR splits show TARO achieves higher unknown recall, reduces confusion between known and unknown objects, and can categorize unknowns with up to 29.9% Hierarchy Accuracy on the OWOD Split, while maintaining competitive known-class mAP. The work also analyzes ablations and discusses future directions, including leveraging Vision-Language Models and multimodal data to further enhance semantic understanding of unknowns in open spaces.

Abstract

Modern object detectors are largely confined to a "closed-world" assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as 'Unknown', this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an 'Unknown Animal' (requiring an urgent stop) versus 'Unknown Debris' (requiring a safe lane change) is far more useful than just 'Unknown' in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.

Paper Structure

This paper contains 36 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Taxonomy tree serving as an example of the hierarchical structure.
  • Figure 2: Overall pipeline of the proposed method. Building on D-DETR, the classification head first applies a hierarchical-aware activation that couples parent and child classes. Based on these activations, we apply a hierarchy-guided relabeling strategy. In standard D-DETR, only queries that are matched to ground-truth objects by the Hungarian matcher are labeled as positives, with all others treated as background. In contrast, our approach relabels queries with strong non-leaf activations as potential objects (e.g., Q2), while queries with weak activations remain background (e.g., Q3). The target in the objectness head will be updated accordingly, providing auxiliary supervision that complements the primary objectness modeling and helps refine objectness learning. The objectness head uses sparsemax to model competition and sparsity among queries, guided by the updated labels from the classification head. Regression head remains identical as the original D-DETR. For clarity, the Hungarian matcher is omitted, and matched queries are indicated by a crown symbol.
  • Figure 3: Qualitative results from TARO (bottom row) compared with OW-DETR (top row) and PROB (middle row). Predicted known objects are shown in blue, while predicted unknown objects are shown in orange. The first two columns illustrate TARO’s capability to detect unknown objects: TARO not only localizes them accurately but also assigns meaningful coarse categories (e.g., the excavator in the first image and the spatula in the second image). The third column highlights TARO’s stable performance in detecting known objects. For fair comparison, the same number of top-k predictions is shown for each image. More qualitative results are available in Appendix \ref{['supp:qualitative']}.
  • Figure 4: More Qualitative Result of TARO
  • Figure 5: Taxonomy of object categories used in TARO.