Generalized Referring Expression Segmentation on Aerial Photos
Luís Marnoto, Alexandre Bernardino, Bruno Martins
TL;DR
The paper tackles open-vocabulary referring expression segmentation in aerial imagery by introducing Aerial-D, a large-scale dataset built from rule-based expressions refined with LLMs and augmented with historic degradations. It trains a unified RSRefSeg model across modern and archival imagery using mixed datasets and a LoRA-tuned SigLIP/SAM backbone, achieving competitive results and robustness to degradation. Extensive ablations demonstrate how expression generation sources and historic augmentations affect cross-dataset generalization, and a distillation strategy makes large-scale LLM annotation cost-effective. The work offers reproducible baselines and points to future directions like multilingual expressions and stronger backbones to push open-vocabulary aerial scene understanding further.
Abstract
Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .
