Table of Contents
Fetching ...

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Jyoti Kini, Rohit Gupta, Mubarak Shah

TL;DR

This work tackles open-vocabulary object detection in aerial imagery by bridging ground-view pretrained Vision-Language Models to the aerial domain. It introduces a cross-view framework with two contrastive objectives: (1) Cross-View Representation Alignment between aerial and ground-view embeddings, $\mathcal{L}_{\textit{ImageA--ImageG}}$, and (2) Aerial-Text Multi-Instance Association, $\mathcal{L}_{\textit{ImageA--Text}}$, along with data generation pipelines that create cross-view correspondences $D_{aligned}$ and vocabulary variants for text bags. The approach yields substantial zero-shot improvements on five remote-sensing datasets, outperforming finetuned baselines and existing OV detectors, and demonstrates model-agnostic transferability via modular alignment losses and data pipelines. Overall, the method enables scalable, flexible open-vocabulary aerial detection without dataset-specific finetuning, with broad implications for real-world aerial surveillance and mapping tasks.

Abstract

Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

TL;DR

This work tackles open-vocabulary object detection in aerial imagery by bridging ground-view pretrained Vision-Language Models to the aerial domain. It introduces a cross-view framework with two contrastive objectives: (1) Cross-View Representation Alignment between aerial and ground-view embeddings, , and (2) Aerial-Text Multi-Instance Association, , along with data generation pipelines that create cross-view correspondences and vocabulary variants for text bags. The approach yields substantial zero-shot improvements on five remote-sensing datasets, outperforming finetuned baselines and existing OV detectors, and demonstrates model-agnostic transferability via modular alignment losses and data pipelines. Overall, the method enables scalable, flexible open-vocabulary aerial detection without dataset-specific finetuning, with broad implications for real-world aerial surveillance and mapping tasks.

Abstract

Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

Paper Structure

This paper contains 18 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Motivation: Ground-view contrastive pretrained open vocabulary detectors (top-left) fail to generalize to aerial views, and traditional finetuning (top-right) results in misaligned feature spaces. Our method (bottom-left) enforces cross-view contrastive alignment and aerial view-text associations, ensuring better semantic consistency and generalization for open-vocabulary aerial object detection (bottom-right). Note: Different object classes are color-coded: Airplane, Car, Truck. Different domains are shape-coded: $\blacktriangle$ Ground-view, $\blacksquare$ Text, $\bigstar$ Aerial-view.
  • Figure 2: Our approach improves aerial-view detection while preserving ground-view performance and avoids catastrophic forgetting typical of naive finetuning.
  • Figure 3: Overview Traditional open-vocabulary object detector finetuning includes two losses: a bi-partite matching based bounding box regression loss and a classification loss computed in the shared image–text embedding space. We further introduce two new components: a ground–aerial contrastive loss that aligns aerial and ground image embeddings, and a text-bag multi-instance contrastive loss that aligns aerial features with text bags. To realize these objectives, we generate an aerial–ground correspondence dataset providing cross-view positives/negatives, and also expand the class vocabulary by using ChatGPT to synthesize variants of class names (“text bags”) consumed by the text encoder. Joint optimization over these losses places aerial $\leftrightarrow$ ground $\leftrightarrow$ text representations in a cohesive shared embedding space, yielding stronger zero-shot generalization in aerial imagery.
  • Figure 4: Aerial-Ground Object Detection Correspondence Step 1 (top-left) shows a case where the category 'Car' is annotated in both aerial and ground-view datasets, allowing direct positive pair generation using ground-truth annotations. Step 2 (top-right) illustrates a case where the category 'Building' lacks ground-view annotations. To address this, OWLv2 is used to generate detections in the ground-view image (bottom row), followed by non-maximum suppression to establish cross-view correspondence data for our contrastive alignment training.
  • Figure 5: Aerial-Ground Object Detection Correspondence
  • ...and 7 more figures