Table of Contents
Fetching ...

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

Xing Zi, Tengjun Ni, Xianjing Fan, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad

TL;DR

AeroLite tackles the challenge of high-quality aerial image captioning under resource constraints by integrating explicit semantic tagging with a lightweight visual encoder and a small language model. It generates a large pseudo-caption corpus via GPT-4o, extracts domain-relevant semantic tags, and fuses them with CLIP-derived visual features through a compact bridging MLP. The model is trained in two stages with LoRA-based fine-tuning, enabling effective domain grounding on 1–3B parameter LMs and outperforming many larger models on UCM and Sydney datasets. This approach offers a practical, interpretable, and on-device-friendly solution for remote sensing captioning, with potential applicability to multi-task aerial intelligence systems.

Abstract

Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

TL;DR

AeroLite tackles the challenge of high-quality aerial image captioning under resource constraints by integrating explicit semantic tagging with a lightweight visual encoder and a small language model. It generates a large pseudo-caption corpus via GPT-4o, extracts domain-relevant semantic tags, and fuses them with CLIP-derived visual features through a compact bridging MLP. The model is trained in two stages with LoRA-based fine-tuning, enabling effective domain grounding on 1–3B parameter LMs and outperforming many larger models on UCM and Sydney datasets. This approach offers a practical, interpretable, and on-device-friendly solution for remote sensing captioning, with potential applicability to multi-task aerial intelligence systems.

Abstract

Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbf{AeroLite}, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1--3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbf{AeroLite} leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite's flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.

Paper Structure

This paper contains 30 sections, 6 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overall AeroLite pipeline for remote sensing captioning. A frozen CLIP encoder (left) extracts a global image embedding (in orange) and predicts multi-label tags (in green). The numeric embedding is passed through a bridging MLP to generate a sequence of visual tokens (shown in pink/white), while the predicted tags are incorporated into a text prompt to form an instruction for the small-scale language model (right). Through LoRA-based or prefix-only fine-tuning, the language model fuses visual tokens and tag-text tokens in a single self-attention context, enabling high-quality captions with minimal computational cost.
  • Figure 2: AeroLite Inference on Small-Scale LM: Example predictions on diverse aerial scenes, highlighting how explicit tags (in green) guide the language model to focus on specialized remote sensing semantics, such as “industrial zone” or “runway.”
  • Figure 3: Illustration of the prompt layout, remote sensing image, semantic segmentation labels, and the generated description.
  • Figure 4: Comparison of the original (pink) vs. filtered (blue) vocabulary distributions on a log scale. The filtering process discards rare and noisy terms, resulting in a more compact yet expressive vocabulary.
  • Figure 5: Bar-chart visualization of partial unfreeze & LoRA performance across different models on the UCM and Sydney caption datasets. In each metric group (BLEU-4, METEOR, ROUGE-L), the bars are sorted from lower to higher, and the best-performing model is highlighted by a dashed rectangle.