GeoCrossBench: Cross-Band Generalization for Remote Sensing
Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, Hrant Khachatrian
TL;DR
GeoCrossBench targets cross-band generalization in remote sensing by extending GeoBench with three evaluation settings that test in-distribution, no-overlap, and superset bands across scene classification, semantic segmentation, and change detection. It introduces ChiViT, a self-supervised ChannelViT baseline trained on a large multi-modal RS pretraining corpus to improve cross-band transfer. The experimental results show that state-of-the-art RS foundation models do not consistently outperform general-purpose vision models in-distribution, and all models suffer substantial degradation when generalizing to new bands, especially under No-Overlap and Superset settings; fine-tuning only the last linear layer with oracle labels can stabilize cross-satellite performance, suggesting the benchmark is not yet saturated. The work highlights the need for robust cross-band generalization strategies and provides public code and data to accelerate progress.
Abstract
The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.
