Table of Contents
Fetching ...

Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

Xiaoran Xu, Xiaoshan Yang, Jiangang Yang, Yifan Xu, Jian Liu, Changsheng Xu

Abstract

Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.

Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

Abstract

Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.

Paper Structure

This paper contains 22 sections, 11 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Cross-modal alignment analysis. (a) PICA preserves more stable and object-centric attention patterns across both base and novel categories compared to Baron wu2023aligning. (b) The difficulty of preserving cross-modal alignment under domain shifts varies across different samples.
  • Figure 2: Overview of the Progressive Domain-invariant Cross-modal Alignment (PICA). The progressive sampler ranks visual and textual region features by ambiguity proxy $h$ and signal strength proxy $q$, dividing them into tiers. It then uses a dynamic sampling ratio $\alpha(\rho)$, where $\rho$ represents the training iteration, to align these features with pseudo-word prototypes in a staged curriculum to enhance cross-modal consistency.
  • Figure 3: Comparative analysis of cross-modal alignment and training stability. (a) Mean AI-gap across OOD domains. (b) Gradient cosine similarity during training.
  • Figure 4: Region-level cross-modal alignment stability between standard and Gaussian noise. Each point represents a region, plotted by its clean-image signal strength $q$ (x-axis) and ambiguity $h$ (y-axis), colored by confusion increase $\Delta h = h_{\text{corrupted}} - h_{\text{clean}}$, demonstrating that curriculum training selectively consolidates alignment robustness for reliable regions.