Table of Contents
Fetching ...

A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Enes Karanfil, Nevrez Imamoglu, Erkut Erdem, Aykut Erdem

TL;DR

Remote-sensing scene understanding in multispectral data is challenged by RGB-centric approaches. Spectral-LLaVA extends a vision-language framework by freezing a SpectralGPT multispectral encoder and learning a lightweight projector to align visual features with a decoder-only LLaMA3, enabling simultaneous scene description and classification. The authors introduce Spectral-Inst, a multispectral instruction-tuning dataset built on BigEarthNet-v2, and demonstrate that language-grounded features yield richer semantic representations and improved downstream performance on EuroSAT and BigEarthNet-v2 benchmarks. This approach enables robust multispectral reasoning with minimal changes to the vision backbone, advancing practical analysis for land-use, coastal monitoring, and environmental applications.

Abstract

Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.

A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

TL;DR

Remote-sensing scene understanding in multispectral data is challenged by RGB-centric approaches. Spectral-LLaVA extends a vision-language framework by freezing a SpectralGPT multispectral encoder and learning a lightweight projector to align visual features with a decoder-only LLaMA3, enabling simultaneous scene description and classification. The authors introduce Spectral-Inst, a multispectral instruction-tuning dataset built on BigEarthNet-v2, and demonstrate that language-grounded features yield richer semantic representations and improved downstream performance on EuroSAT and BigEarthNet-v2 benchmarks. This approach enables robust multispectral reasoning with minimal changes to the vision backbone, advancing practical analysis for land-use, coastal monitoring, and environmental applications.

Abstract

Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Proposed Spectral-LlaVA vision-language framework for multispectral images.
  • Figure 2: Qualitative chat samples with given multispectral images (images in the figure are contrast enhanced RGB image version of multispectral data just for visualization.). Sample descriptions are given by taking the first a few sentences of model output just to provide visual examples.
  • Figure 3: Comparison of visual features and aligned features projected in a 2D space using t-SNE for EuroSAT miltispectral data. The alignment highlights the transformation of raw visual features into a domain-aligned latent space, showcasing clustering improvements.
  • Figure 4: Test Accuracy Comparison for SpectralGPT and Spectral-LLaVA features through linear probing on EuroSAT dataset with 5-fold cross-validation using various train-test split ratios.