Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Aidan M. Swope; Xander H. Rudelis; Kyle T. Story

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Aidan M. Swope, Xander H. Rudelis, Kyle T. Story

TL;DR

This work tackles label scarcity in remote sensing by introducing Contrastive Sensor Fusion (CSF), a self-supervised objective that learns fused representations across multiple sensors. CSF generates two views from random channel subsets, encodes them with a shared Siamese network, and optimizes a multi-layer InfoNCE loss to align high-level scene representations. On a 47 million-triplet unlabeled dataset, CSF yields semantically meaningful features that outperform ImageNet pretraining on downstream OpenStreetMap-based tasks, with improvements accumulating as more sensors are fused. The approach promises robust multi-sensor representations that generalize across modalities and holds strong practical implications for scalable remote sensing analysis without labeled data.

Abstract

In the application of machine learning to remote sensing, labeled data is often scarce or expensive, which impedes the training of powerful models like deep convolutional neural networks. Although unlabeled data is abundant, recent self-supervised learning approaches are ill-suited to the remote sensing domain. In addition, most remote sensing applications currently use only a small subset of the multi-sensor, multi-channel information available, motivating the need for fused multi-sensor representations. We propose a new self-supervised training objective, Contrastive Sensor Fusion, which exploits coterminous data from multiple sources to learn useful representations of every possible combination of those sources. This method uses information common across multiple sensors and bands by training a single model to produce a representation that remains similar when any subset of its input channels is used. Using a dataset of 47 million unlabeled coterminous image triplets, we train an encoder to produce semantically meaningful representations from any possible combination of channels from the input sensors. These representations outperform fully supervised ImageNet weights on a remote sensing classification task and improve as more sensors are fused. Our code is available at https://storage.cloud.google.com/public-published-datasets/csf_code.zip.

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

TL;DR

Abstract

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)