Table of Contents
Fetching ...

Hyperbolic Learning with Synthetic Captions for Open-World Detection

Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

TL;DR

This work tackles open-world detection by leveraging synthetic captions generated from a strong vision-language model to enrich region-level descriptions. It introduces HyperLearner, a detector that learns visual-text representations in a hyperbolic space to impose a 'caption entails object' hierarchy, mitigating hallucinations in synthetic captions. Across COCO, LVIS, ODinW, and RefCOCO-family benchmarks, HyperLearner achieves state-of-the-art performance with efficient backbones and demonstrates robust open-world generalization. The combination of dense caption bootstrapping, cross-modal alignment, and hyperbolic learning yields strong empirical gains and provides a foundation for extending open-world understanding with synthetic, scalable supervision.

Abstract

Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

Hyperbolic Learning with Synthetic Captions for Open-World Detection

TL;DR

This work tackles open-world detection by leveraging synthetic captions generated from a strong vision-language model to enrich region-level descriptions. It introduces HyperLearner, a detector that learns visual-text representations in a hyperbolic space to impose a 'caption entails object' hierarchy, mitigating hallucinations in synthetic captions. Across COCO, LVIS, ODinW, and RefCOCO-family benchmarks, HyperLearner achieves state-of-the-art performance with efficient backbones and demonstrates robust open-world generalization. The combination of dense caption bootstrapping, cross-modal alignment, and hyperbolic learning yields strong empirical gains and provides a foundation for extending open-world understanding with synthetic, scalable supervision.

Abstract

Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.
Paper Structure (18 sections, 16 equations, 7 figures, 10 tables)

This paper contains 18 sections, 16 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: We tackle the task of detecting seen and unseen objects using keywords (e.g., panda) or free-form texts (e.g., a black-and-white giant bear) in open world. We exploit synthetic captions from pre-trained caption models to bring rich open-world knowledge for training. As synthetic captions may be noisy, we propose to align visual features with text embeddings in a structural hierarchy to learn robustly and effectively from these captions.
  • Figure 2: Approach overview. Given an image, our open-world detector extracts visual feature embeddings on region crops, and aligns visual feature embeddings with text embeddings extracted from a pre-trained text encoder (§ \ref{['sec:prelim']}). To obtain synthetic captions for both seen and novel objects, we propose region sampling to augment region crops, and adopt a pre-trained image captioner to bootstrap synthetic captions on these region crops (§ \ref{['sec:syn_caps']}). To learn from synthetic captions effectively, we propose to align visual feature embeddings and caption embeddings in a structural hierarchy through hyperbolic vision-language learning in the hyperbolic space (see Figure \ref{['fig:loss']}, § \ref{['sec:hyp_learn']}).
  • Figure 3: Illustration of hyperbolic vision-language learning. The visual and caption embeddings are lifted from (a) Euclidean space to (b) Hyperbolic space by exponential mapping (Eq. \ref{['eq:hyp_exp']}). To learn the partial order of 'caption entails object', we propose hyperbolic contrastive loss, hyperbolic entailment loss (Eq. \ref{['eq:hyp_contrastive']}, Eq. \ref{['eq:entail']}) to align visual and caption embeddings in hierarchy, thus ensuring the hallucination in caption is not aligned directly with visual embeddings to negatively impact the model learning.
  • Figure 4: Comparison between baseline objective Eq. \ref{['eq:baseline']} and our objective Eq. \ref{['eq:ours']} on COCO during training. Metric: mAP.
  • Figure 5: Qualitative results on open-world detection. Given any new class labels (row 1) or free-form texts (row 2) that specify objects with attribute, action, interaction, and spatial relationship, our model can detect and localize the objects in images.
  • ...and 2 more figures