Table of Contents
Fetching ...

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Aidan M. Swope, Xander H. Rudelis, Kyle T. Story

TL;DR

This work tackles label scarcity in remote sensing by introducing Contrastive Sensor Fusion (CSF), a self-supervised objective that learns fused representations across multiple sensors. CSF generates two views from random channel subsets, encodes them with a shared Siamese network, and optimizes a multi-layer InfoNCE loss to align high-level scene representations. On a 47 million-triplet unlabeled dataset, CSF yields semantically meaningful features that outperform ImageNet pretraining on downstream OpenStreetMap-based tasks, with improvements accumulating as more sensors are fused. The approach promises robust multi-sensor representations that generalize across modalities and holds strong practical implications for scalable remote sensing analysis without labeled data.

Abstract

In the application of machine learning to remote sensing, labeled data is often scarce or expensive, which impedes the training of powerful models like deep convolutional neural networks. Although unlabeled data is abundant, recent self-supervised learning approaches are ill-suited to the remote sensing domain. In addition, most remote sensing applications currently use only a small subset of the multi-sensor, multi-channel information available, motivating the need for fused multi-sensor representations. We propose a new self-supervised training objective, Contrastive Sensor Fusion, which exploits coterminous data from multiple sources to learn useful representations of every possible combination of those sources. This method uses information common across multiple sensors and bands by training a single model to produce a representation that remains similar when any subset of its input channels is used. Using a dataset of 47 million unlabeled coterminous image triplets, we train an encoder to produce semantically meaningful representations from any possible combination of channels from the input sensors. These representations outperform fully supervised ImageNet weights on a remote sensing classification task and improve as more sensors are fused. Our code is available at https://storage.cloud.google.com/public-published-datasets/csf_code.zip.

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

TL;DR

This work tackles label scarcity in remote sensing by introducing Contrastive Sensor Fusion (CSF), a self-supervised objective that learns fused representations across multiple sensors. CSF generates two views from random channel subsets, encodes them with a shared Siamese network, and optimizes a multi-layer InfoNCE loss to align high-level scene representations. On a 47 million-triplet unlabeled dataset, CSF yields semantically meaningful features that outperform ImageNet pretraining on downstream OpenStreetMap-based tasks, with improvements accumulating as more sensors are fused. The approach promises robust multi-sensor representations that generalize across modalities and holds strong practical implications for scalable remote sensing analysis without labeled data.

Abstract

In the application of machine learning to remote sensing, labeled data is often scarce or expensive, which impedes the training of powerful models like deep convolutional neural networks. Although unlabeled data is abundant, recent self-supervised learning approaches are ill-suited to the remote sensing domain. In addition, most remote sensing applications currently use only a small subset of the multi-sensor, multi-channel information available, motivating the need for fused multi-sensor representations. We propose a new self-supervised training objective, Contrastive Sensor Fusion, which exploits coterminous data from multiple sources to learn useful representations of every possible combination of those sources. This method uses information common across multiple sensors and bands by training a single model to produce a representation that remains similar when any subset of its input channels is used. Using a dataset of 47 million unlabeled coterminous image triplets, we train an encoder to produce semantically meaningful representations from any possible combination of channels from the input sensors. These representations outperform fully supervised ImageNet weights on a remote sensing classification task and improve as more sensors are fused. Our code is available at https://storage.cloud.google.com/public-published-datasets/csf_code.zip.

Paper Structure

This paper contains 15 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Learned representations of out-of-sample image scenes, visualized with PCA followed by t-SNE and colored by OpenStreetMap category. Without any labels, Contrastive Sensor Fusion has learned a representation that groups remote sensing images into semantically meaningful categories.
  • Figure 2: Coterminous remote sensing imagery from three different sensors: Airbus SPOT, NAIP (visualized here in near-infrared, red, and green), and Airbus Pléiades (see Appendix \ref{['section:trainingDataset']} for details). As seen here, images contain many small components (roads, buildings, structures, trees) and adjacent locations can look completely different (e.g., the transition from buildings to grass). We leverage these multiple views to generate representations with any subset of available sensors or channels. Image attribution (left and right): © AIRBUS DS 2019.
  • Figure 3: Contrastive Sensor Fusion architecture during training. Weights are shared across encoder copies. The contrastive loss trains the encoder to represent the same (different) location the same (different) way regardless of sensor/channel combination. The process to create views is explained in more detail in \ref{['fig:view_creation']}, and the computation of the loss is detailed in Appendix \ref{['section:loss']}. Image attribution: © AIRBUS DS 2019.
  • Figure 4: We compare the clustering of features based on OSM class using a nearest neighbor metric. The plots show the fraction of same-class neighbors for each point ($k=10$) as input channels are added (left), and the fraction of same-class neighbors as a function of $k$ (right). One, two, and three-channel experiments always use a single sensor, taking the red band only, the red and green bands, and the RGB bands respectively. Our features outperform ImageNet's in this unsupervised clustering metric and improve when multiple sensors are fused.
  • Figure 5: For each of the first three principal components of the 12-channel CSF representation space, we show 10 images from each single sensor (with inputs for the other two sensor zeroed) that maximally activate these directions. These principal components of representation space represent contain concepts (fields, bridges, and bare ground / concrete) stable across sensor combinations. Image attribution (SPOT, PHR): © AIRBUS DS 2019.
  • ...and 1 more figures