Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Jyoti Kini; Rohit Gupta; Mubarak Shah

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Jyoti Kini, Rohit Gupta, Mubarak Shah

TL;DR

This work tackles open-vocabulary object detection in aerial imagery by bridging ground-view pretrained Vision-Language Models to the aerial domain. It introduces a cross-view framework with two contrastive objectives: (1) Cross-View Representation Alignment between aerial and ground-view embeddings, $\mathcal{L}_{\textit{ImageA--ImageG}}$, and (2) Aerial-Text Multi-Instance Association, $\mathcal{L}_{\textit{ImageA--Text}}$, along with data generation pipelines that create cross-view correspondences $D_{aligned}$ and vocabulary variants for text bags. The approach yields substantial zero-shot improvements on five remote-sensing datasets, outperforming finetuned baselines and existing OV detectors, and demonstrates model-agnostic transferability via modular alignment losses and data pipelines. Overall, the method enables scalable, flexible open-vocabulary aerial detection without dataset-specific finetuning, with broad implications for real-world aerial surveillance and mapping tasks.

Abstract

Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

TL;DR

Abstract

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)