Table of Contents
Fetching ...

Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features

Yuzhen Hu, Biplab Banerjee, Saurabh Prasad

TL;DR

This work tackles label-efficient hyperspectral image classification by repurposing a frozen diffusion model trained on natural images to extract spatial features, which are then fused with spectral information through FiLM-based modulation. The proposed GeoDiffNet uses low-level diffusion features to generalize across geospatial domains without finetuning, while GeoDiffNet-F adds spectral-conditioned FiLM to enable dynamic multimodal fusion under sparse supervision. Key findings show that early diffusion timesteps and higher decoder layers yield more transferable spatial features, and that spectral FiLM fusion outperforms baselines on Augsburg and Berlin datasets. The approach demonstrates strong cross-domain transferability and practical potential for remote sensing tasks with limited labeled data, with code made publicly available.

Abstract

Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.

Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features

TL;DR

This work tackles label-efficient hyperspectral image classification by repurposing a frozen diffusion model trained on natural images to extract spatial features, which are then fused with spectral information through FiLM-based modulation. The proposed GeoDiffNet uses low-level diffusion features to generalize across geospatial domains without finetuning, while GeoDiffNet-F adds spectral-conditioned FiLM to enable dynamic multimodal fusion under sparse supervision. Key findings show that early diffusion timesteps and higher decoder layers yield more transferable spatial features, and that spectral FiLM fusion outperforms baselines on Augsburg and Berlin datasets. The approach demonstrates strong cross-domain transferability and practical potential for remote sensing tasks with limited labeled data, with code made publicly available.

Abstract

Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.

Paper Structure

This paper contains 28 sections, 3 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Workflow of GeoDiffNet and GeoDiffNet-F. GeoDiffNet extracts low-level spatial features from RGB-like patches using a frozen pretrained diffusion model. A lightweight MLP is applied to each pixel for classification. GeoDiffNet-F further incorporates spectral context by encoding per-pixel reflectance signals into spectral embeddings, which are used to regress scaling ($\gamma$) and shifting ($\beta$) vectors through an MLP. These vectors condition the spatial features via a FiLM layer, enabling adaptive cross-modal fusion for land-cover classification.
  • Figure 2: Both dataset performance metrics peak at higher layers, capturing low-level features. (a) Augsburg: performance peaks at layer 10 (timestep 0). (b) Berlin: performance peaks at layer 11 (timestep 50).
  • Figure 3: Visualization on Berlin HSI: (a) RGB image, (b) Training label map, (c) Ground-Truth (test label), (d) GeoDiffNet output, and (e) GeoDiffNet-F.
  • Figure 4: Visualization on Augsburg HSI: (a) Pseudo-RGB image, (b) Training label map, (c) Ground-truth, (d) GeoDiffNet, and (e) GeoDiffNet-F.
  • Figure 5: Feature Clustering across decoder layers and timesteps. K-means clustering ($k{=}6$) is applied to decoder features from layers 6–11 across timesteps $T{=}0$ to $T{=}200$. The input is a $64{\times}64$ pseudo-RGB patch sampled from the Berlin hyperspectral dataset. Left: the original pseudo-RGB patch and a high-resolution reference image from Google Earth (circa 2009) are shown for context. Note that cluster colors are assigned independently in each plot and are therefore not consistent across layers or timesteps.
  • ...and 3 more figures