Table of Contents
Fetching ...

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi

Abstract

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

Abstract

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.
Paper Structure (26 sections, 4 equations, 6 figures, 6 tables)

This paper contains 26 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The t-SNE visualization of visual embeddings from different datasets. The same class demonstrates large variations across datasets. However, only class names are provided in the datasets, which limits multimodal alignment.
  • Figure 2: Overview of AVION.Upper-left (Training Student). Learnable prompt tokens are injected into the text and vision encoders. The student outputs embeddings: ${\mathbf t}^{S}_{k}$ (student text embedding for class $k$) and ${\mathbf v}^{S}_{i}$ (visual embedding for image $i$). Bottom-left (Offline Teacher).LLM-based Domain Prompting generates multiple class-aware descriptions, which are encoded into $\mathbf t^{T}_{k,j}$. Selective Prototype Aggregation verifies these candidates using teacher visual embeddings ${\mathbf v}^{T}_{i}$ and aggregates them into an RS-aware text prototype ${\mathbf t}^{T*}_{k}$. Right (Training Objectives). Tri-aspect alignment: (i) textual alignment$\mathcal{L}_{\text{text}}$ maximizes the cosine similarity between ${\mathbf t}^{S}_{k}$ and ${\mathbf t}^{T*}_{k}$; (ii) visual alignment$\mathcal{L}_{\text{img}}$ aligns ${\mathbf v}^{S}_{i}$ with ${\mathbf v}^{T}_{i}$; (iii) similarity logit alignment$\mathcal{L}_{\text{logit}}$ matches teacher logits $s^{T}_{i,k}=({\mathbf v}^{T}_{i})^{\top}{\mathbf t}^{T*}_{k}$ and student logits $s^{S}_{i,k}=({\mathbf v}^{S}_{i})^{\top}{\mathbf t}^{S}_{k}$ via temperature-scaled KL. A standard task loss $\mathcal{L}_{\text{task}}$ (cross-entropy) is applied on the student.
  • Figure 3: LLM-based Domain Prompting. Given a class $k$, an RS-aware query asks the LLM to produce aerial-view descriptions. RS-Flag is used to examine whether the description contains RS-related tokens.
  • Figure 4: Selective Prototype Aggregation. For each class $k$, teacher image embeddings $\{\mathbf v^{T}_{k,i}\}_{i\in\mathcal{B}_k}$ are averaged to form the visual prototype ${\widehat{\mathbf v}}^{T}_{k}$. Each LLM-generated description is encoded to a teacher text embedding $\mathbf t_{k,j}$ and scored by cosine similarity $s_{k,j}=({\widehat{\mathbf v}}^{T}_{k})^{\top}\mathbf t^{T}_{k,j}$. A median/MAD threshold $\pm\zeta_s$ removes outlier candidates. The remaining embeddings are combined with softmax-normalized weights (over kept $j$) to obtain the class textual prototype ${\mathbf t}^{T*}_{k}$.
  • Figure 5: t-SNE of base-to-novel visual and text embeddings on RESISC-45. (a) GeoRSCLIP zhang2024rs5m (b) AVION (ours) (c) APPLeNet jha2023applenet (d) MMRL guo2025mmrl. Gray dots represent the visual embedding of base classes. Colored dots represent the visual embedding of novel classes with each color standing for a class. Colored crosses represent the text embedding of novel classes for their class names. A, B, C, and D are four classes.
  • ...and 1 more figures