Table of Contents
Fetching ...

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

TL;DR

This work tackles the difficulty of vast vocabulary object detection by identifying gradient-dilution issues that plague classification-based detectors as category counts explode. It introduces CQ-DINO, a category-query–based detector that uses learnable category queries and image-guided query selection to sparsify the search space, balance gradients, and implicitly mine hard negatives. Category correlations are encoded via explicit hierarchical trees for structured data or self-attention for unstructured data, enabling effective reasoning over thousands of categories. Empirically, CQ-DINO achieves state-of-the-art results on V3Det (over 2 AP improvement) while remaining competitive on COCO, and demonstrates strong scalability up to very large vocabularies with efficient memory use. The public code enhances reproducibility and paves the way for practical wide-vocabulary detection systems.

Abstract

With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/FireRedTeam/CQ-DINO.

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

TL;DR

This work tackles the difficulty of vast vocabulary object detection by identifying gradient-dilution issues that plague classification-based detectors as category counts explode. It introduces CQ-DINO, a category-query–based detector that uses learnable category queries and image-guided query selection to sparsify the search space, balance gradients, and implicitly mine hard negatives. Category correlations are encoded via explicit hierarchical trees for structured data or self-attention for unstructured data, enabling effective reasoning over thousands of categories. Empirically, CQ-DINO achieves state-of-the-art results on V3Det (over 2 AP improvement) while remaining competitive on COCO, and demonstrates strong scalability up to very large vocabularies with efficient memory use. The public code enhances reproducibility and paves the way for practical wide-vocabulary detection systems.

Abstract

With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/FireRedTeam/CQ-DINO.

Paper Structure

This paper contains 40 sections, 9 equations, 8 figures, 13 tables, 3 algorithms.

Figures (8)

  • Figure 1: Comparison of category prediction mechanisms for vast vocabulary object detection. (a) Classification head-based detectors with fixed FFN layers face severe optimization challenges with increasing vocabulary size. (b) Text-prompted contrastive detectors leverage VLMs but require multiple inference passes for vast category lists. (c) Language model generated detectors enable open-ended detection but lack control over category granularity. (d) Our proposed CQ-DINO encodes categories as learnable category queries and leverages query selection to identify the most relevant categories in the image, achieving both scalability and improved performance.
  • Figure 2: Positive-to-negative gradient ratio comparing CQ-DINO against DINO with Focal Loss (FL) and Cross-Entropy Loss (CE) on V3Det and COCO datasets, showing the initial 2k training iterations where differences are most evident.
  • Figure 3: Overview of the CQ-DINO framework for vast vocabulary object detection. Key components: (1) Learnable category queries enhanced with hierarchical tree construction for semantic relationship modeling; (2) Image-guided query selection that identifies the most relevant category queries; (3) Feature enhancer and cross-modality decoder (adapted from GroundingDINO gdino), processing object queries with contrastive alignment between object and selected category queries.
  • Figure 4: Illustration of image-guided query selection module.
  • Figure 5: Hierarchical tree construction for category queries.
  • ...and 3 more figures