Table of Contents
Fetching ...

Generalized Referring Expression Segmentation on Aerial Photos

Luís Marnoto, Alexandre Bernardino, Bruno Martins

TL;DR

The paper tackles open-vocabulary referring expression segmentation in aerial imagery by introducing Aerial-D, a large-scale dataset built from rule-based expressions refined with LLMs and augmented with historic degradations. It trains a unified RSRefSeg model across modern and archival imagery using mixed datasets and a LoRA-tuned SigLIP/SAM backbone, achieving competitive results and robustness to degradation. Extensive ablations demonstrate how expression generation sources and historic augmentations affect cross-dataset generalization, and a distillation strategy makes large-scale LLM annotation cost-effective. The work offers reproducible baselines and points to future directions like multilingual expressions and stronger backbones to push open-vocabulary aerial scene understanding further.

Abstract

Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .

Generalized Referring Expression Segmentation on Aerial Photos

TL;DR

The paper tackles open-vocabulary referring expression segmentation in aerial imagery by introducing Aerial-D, a large-scale dataset built from rule-based expressions refined with LLMs and augmented with historic degradations. It trains a unified RSRefSeg model across modern and archival imagery using mixed datasets and a LoRA-tuned SigLIP/SAM backbone, achieving competitive results and robustness to degradation. Extensive ablations demonstrate how expression generation sources and historic augmentations affect cross-dataset generalization, and a distillation strategy makes large-scale LLM annotation cost-effective. The work offers reproducible baselines and points to future directions like multilingual expressions and stronger backbones to push open-vocabulary aerial scene understanding further.

Abstract

Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .

Paper Structure

This paper contains 18 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Representative examples from the Aerial-D dataset, showing diverse types of referring expressions paired together with the corresponding aerial images and the ground truth segmentation masks.
  • Figure 2: Overview of the RSRefSeg architecture chen2025rsrefseg, which couples a pre-trained vision--language encoder with a segmentation decoder via a learned bridge.
  • Figure 3: An example illustrating rule-based generation for a single instance. The highlighted plane in the top-right section demonstrates how the system assigns spatial, visual, and relational cues that will later be combined into referring expressions.
  • Figure 4: An example illustrating the complete expression generation and enhancement pipeline. Starting from the aerial image (left), the rule-based approach generates initial expressions, which the LLM then refines into language variations and visual variations that incorporate additional contextual details (right).
  • Figure 5: Ilustration for the application of filters used for simulating historical images. The image shows the original RGB capture (far left), grayscale conversion, grayscale with grain, and sepia toning with sensor noise. Each variant preserves structure while introducing degradations representative of archival imagery.
  • ...and 6 more figures