Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Minh Kha Do; Wei Xiang; Kang Han; Di Wu; Khoa Phan; Yi-Ping Phoebe Chen; Gaowen Liu; Ramana Rao Kompella

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella

TL;DR

Sattxt is presented, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training, and improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines.

Abstract

Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 10 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Representation Learning in Remote Sensing
Vision-Language Foundation Models for Remote Sensing
Method
Spectral Representation Distillation
Spectrally Grounded Alignment with Instruction-Augmented LLMs
Experiments
Results
Ablation Study
Conclusion
Acknowledgement
Limitations and Future Work
Additional Quantitative Results
Additional Qualitative Results
...and 2 more sections

Figures (10)

Figure 1: outperforms existing VLFMs across three satellite benchmarks while requiring only RGB inputs. By contrast, multi-spectral VLFMs (e.g., DOFA-CLIP dofaclip) exhibit inconsistent gains from multi-spectral (MS) inputs.
Figure 2: Patch-wise similarity between images and label prompts for models pre-trained on the same dataset. FT- denotes further pre-trained on this dataset, consistent with Llama3-MS-CLIP and . Using an instruction-augmented LLM as the text encoder, our method produces sharper object-level focus (e.g., river) and clearer contextual relations (e.g., residential), supporting strong zero-shot predictions.
Figure 3: Pre-training strategies for VLFMs on satellite imagery. (a) CLIP-style continued pre-training (e.g., RemoteCLIP), (b) LiT: Locked-image Tuning (e.g., ), (c) LTT: Locked-text Tuning (e.g., DOFA-CLIP), and (d) : bridges two strong, frozen encoders via instruction-augmented text from an LLM, improving training efficiency as well as zero-shot and linear-probe performance.
Figure 4: Two-stage pre-training pipeline for . Stage 1 () - dashed lines: a vision projector is trained to reconstruct multi-spectral representations from an RGB encoder by distilling a frozen MS teacher, transferring spectral knowledge so MS inputs are unnecessary in Stage 2 and at inference. Stage 2 () - solid lines: with vision and text encoders frozen, distilled vision features are aligned with LLM-based text embeddings using instruction-augmented prompts, enhancing cross-modal representations while preserving pretrained capabilities.
Figure 5: Patch-wise image-text similarity maps. For each image, we compute cosine similarity between patch-level vision embeddings and the text embedding produced from the prompt "a satellite image of {class}". Compared to baselines, yields sharper and more contiguous responses that trace class-consistent structures (e.g., linear rivers and highways) and reduce spurious activations in the background.
...and 5 more figures

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

TL;DR

Abstract

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (10)