Table of Contents
Fetching ...

State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection

Jiaying Zhou, Qingchao Chen

TL;DR

This work tackles WS-OVOD by addressing two bottlenecks: static semantic prototypes that miss intra-class state variation and semantic mismatch between context-rich visual proposals and object-centric text embeddings. It introduces State-Enhanced Semantic Prototypes (SESP) to capture diverse object states via LLM-generated state descriptors and a generic description, and Scene-Augmented Pseudo Prototypes (SAPP) to embed contextual scene information and softly align them with weakly supervised proposals. The overall objective combines standard detection losses with a scene-alignment term, enabling end-to-end training with both detection and classification data. Empirical results on OV-COCO and OV-LVIS show clear gains, especially for novel categories, and cross-dataset transfer on Objects365 demonstrates strong generalization, underscoring the practical impact of richer, context-aware language-vision prototypes for open-vocabulary detection.

Abstract

Open-Vocabulary Object Detection (OVOD) aims to generalize object recognition to novel categories, while Weakly Supervised OVOD (WS-OVOD) extends this by combining box-level annotations with image-level labels. Despite recent progress, two critical challenges persist in this setting. First, existing semantic prototypes, even when enriched by LLMs, are static and limited, failing to capture the rich intra-class visual variations induced by different object states (e.g., a cat's pose). Second, the standard pseudo-box generation introduces a semantic mismatch between visual region proposals (which contain context) and object-centric text embeddings. To tackle these issues, we introduce two complementary prototype enhancement strategies. To capture intra-class variations in appearance and state, we propose the State-Enhanced Semantic Prototypes (SESP), which generates state-aware textual descriptions (e.g., "a sleeping cat") to capture diverse object appearances, yielding more discriminative prototypes. Building on this, we further introduce Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch. SAPP incorporates contextual semantics (e.g., "cat lying on sofa") and utilizes a soft alignment mechanism to promote contextually consistent visual-textual representations. By integrating SESP and SAPP, our method effectively enhances both the richness of semantic prototypes and the visual-textual alignment, achieving notable improvements.

State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection

TL;DR

This work tackles WS-OVOD by addressing two bottlenecks: static semantic prototypes that miss intra-class state variation and semantic mismatch between context-rich visual proposals and object-centric text embeddings. It introduces State-Enhanced Semantic Prototypes (SESP) to capture diverse object states via LLM-generated state descriptors and a generic description, and Scene-Augmented Pseudo Prototypes (SAPP) to embed contextual scene information and softly align them with weakly supervised proposals. The overall objective combines standard detection losses with a scene-alignment term, enabling end-to-end training with both detection and classification data. Empirical results on OV-COCO and OV-LVIS show clear gains, especially for novel categories, and cross-dataset transfer on Objects365 demonstrates strong generalization, underscoring the practical impact of richer, context-aware language-vision prototypes for open-vocabulary detection.

Abstract

Open-Vocabulary Object Detection (OVOD) aims to generalize object recognition to novel categories, while Weakly Supervised OVOD (WS-OVOD) extends this by combining box-level annotations with image-level labels. Despite recent progress, two critical challenges persist in this setting. First, existing semantic prototypes, even when enriched by LLMs, are static and limited, failing to capture the rich intra-class visual variations induced by different object states (e.g., a cat's pose). Second, the standard pseudo-box generation introduces a semantic mismatch between visual region proposals (which contain context) and object-centric text embeddings. To tackle these issues, we introduce two complementary prototype enhancement strategies. To capture intra-class variations in appearance and state, we propose the State-Enhanced Semantic Prototypes (SESP), which generates state-aware textual descriptions (e.g., "a sleeping cat") to capture diverse object appearances, yielding more discriminative prototypes. Building on this, we further introduce Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch. SAPP incorporates contextual semantics (e.g., "cat lying on sofa") and utilizes a soft alignment mechanism to promote contextually consistent visual-textual representations. By integrating SESP and SAPP, our method effectively enhances both the richness of semantic prototypes and the visual-textual alignment, achieving notable improvements.

Paper Structure

This paper contains 17 sections, 10 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Semantic Prototype Augmentation for Weakly Supervised Open-Vocabulary Object Detection. (a) State-Enhanced Semantic Prototypes (SESP). Standard category names (e.g., "cat") inadequately capture intra-class visual diversity across various object states (e.g., "sleeping cat," "sitting cat"). SESP addresses this by augmenting prototypes with state-aware textual descriptions for more discriminative representations. (b) Scene-Augmented Pseudo Prototypes (SAPP). Standard pseudo-boxes (max-size proposals used in image classification data) often include substantial contextual information, leading to misalignment with class-centric prototypes. SAPP resolves this by incorporating contextual semantics to generate pseudo prototypes and using a soft alignment mechanism to bridge the visual-textual gap.
  • Figure 2: Overview of our proposed method. (a) For detection data $D_{det}$, RoI features are aligned to state-enhanced semantic prototypes $\textit{p}_c$. For classification data $D_{cls}$, only the max-size proposal is retained and its feature is aligned with $\textit{p}_c$ for weakly supervised classification and softly aligned with scene-augmented pseudo prototypes $\textit{w}_{scene,c}$ to capture contextual semantics. (b) State-enhanced semantic prototypes are generated by prompting a large language model (LLM) with state-specific templates, followed by aggregation across diverse state descriptions. (c) Scene-augmented pseudo prototypes are derived by prompting the LLM with context-aware templates, capturing object-scene interactions to enrich representation. Note that the RPN loss $\mathcal{L}_{\text{rpn}}$ and the bounding box regression loss $\mathcal{L}_{\text{reg}}$, which constitute the standard detection losses trained only on $D_{\text{det}}$, are omitted for clarity.
  • Figure 3: Qualitative comparison of detection results using standard class-name (CName) prototypes (top row) versus our state-enhanced semantic prototypes (bottom row). The detector is trained on the OV-COCO dataset. At inference time, only the "cat" category is treated as the target class, and predictions are filtered with a confidence threshold of 0.3.