Table of Contents
Fetching ...

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Filip Wolf, Blaž Rolih, Luka Čehovin Zajc

TL;DR

This work proposes a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs).

Abstract

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Project page: \textcolor{magenta}{https://wolfilip.github.io/DEO/}.

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

TL;DR

This work proposes a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs).

Abstract

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Project page: \textcolor{magenta}{https://wolfilip.github.io/DEO/}.
Paper Structure (25 sections, 7 equations, 8 figures, 14 tables)

This paper contains 25 sections, 7 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: DEO, our proposed dual-teacher pretraining approach, results in a model that achieves state-of-the-art results in multispectral EO tasks while maintaining performance on optical EO tasks. On top, we demonstrate our performance in optical and multispectral semantic segmentation, visualizing model size using colored circles. Below, the first row of images shows qualitative results for the optical SpaceNetv1 van2018spacenet dataset, while the second row shows results for the multispectral m-SA-crop-type dataset lacoste2023geo.
  • Figure 2: Overview of the double-distillation pretraining approach. The pretraining dataset utilizes a standard FMoW Sentinel-2 dataset, augmented with high-resolution aerial images where possible. Random crops and other augmentation operations are performed on sampled images. Full multispectral and optical channel subsets are used for their corresponding distillation branches. The multispectral branch is a contrastive learning setup where the teacher is updated using EMA. In the optical branch, distillation is done using a frozen VFM teacher. The resulting model can then be used in various downstream tasks.
  • Figure 3: PCA feature visualization and comparison between Copernicus-FM wang2025towards, DINOv3-LS simeoni2025dinov3, and DEO (ours). We note the similarity of our method's features to those of DINOv3.
  • Figure 4: Qualitative results for semantic segmentation. The first two columns contain the optical part of the input image and the ground truth. The final three columns contain predictions from two related models, and our own.
  • Figure 5: Extended qualitative results for Sen1Floods11.
  • ...and 3 more figures