Table of Contents
Fetching ...

Context-self contrastive pretraining for crop type semantic segmentation

Michail Tarasiou, Riza Alp Guler, Stefanos Zafeiriou

TL;DR

The paper tackles boundary misclassification in pixel-level crop-type segmentation from Satellite Image Time Series by introducing Context-Self Contrastive Loss (CSCL), a fully supervised contrastive pre-training scheme that enforces semantic-consistent embeddings between each location and its local neighborhood. CSCL computes a local affinity within a dilated window, augmented with relative positional encodings, and optimizes a cosine-based contrastive loss over reformatted dense ground-truth labels; this pre-training proceeds without extra data and improves boundary delineation in dense segmentation. Empirically, CSCL achieves state-of-the-art results on France and Germany crop-type datasets, and the authors release the largest publicly available SITS crop-segmentation dataset with a ×4 super-resolution ground truth, enabling higher-resolution crop mapping. The findings show strong boundary gains, robust ablations, and practical benefits for high-granularity crop monitoring and policy-support applications, with broad potential for integrating CSCL into diverse dense-prediction tasks.

Abstract

In this paper, we propose a fully supervised pre-training scheme based on contrastive learning particularly tailored to dense classification tasks. The proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that makes semantic boundaries pop-up by use of a similarity metric between every location in a training sample and its local context. For crop type semantic segmentation from Satellite Image Time Series (SITS) we find performance at parcel boundaries to be a critical bottleneck and explain how CSCL tackles the underlying cause of that problem, improving the state-of-the-art performance in this task. Additionally, using images from the Sentinel-2 (S2) satellite missions we compile the largest, to our knowledge, SITS dataset densely annotated by crop type and parcel identities, which we make publicly available together with the data generation pipeline. Using that data we find CSCL, even with minimal pre-training, to improve all respective baselines and present a process for semantic segmentation at super-resolution for obtaining crop classes at a more granular level. The code and instructions to download the data can be found in https://github.com/michaeltrs/DeepSatModels.

Context-self contrastive pretraining for crop type semantic segmentation

TL;DR

The paper tackles boundary misclassification in pixel-level crop-type segmentation from Satellite Image Time Series by introducing Context-Self Contrastive Loss (CSCL), a fully supervised contrastive pre-training scheme that enforces semantic-consistent embeddings between each location and its local neighborhood. CSCL computes a local affinity within a dilated window, augmented with relative positional encodings, and optimizes a cosine-based contrastive loss over reformatted dense ground-truth labels; this pre-training proceeds without extra data and improves boundary delineation in dense segmentation. Empirically, CSCL achieves state-of-the-art results on France and Germany crop-type datasets, and the authors release the largest publicly available SITS crop-segmentation dataset with a ×4 super-resolution ground truth, enabling higher-resolution crop mapping. The findings show strong boundary gains, robust ablations, and practical benefits for high-granularity crop monitoring and policy-support applications, with broad potential for integrating CSCL into diverse dense-prediction tasks.

Abstract

In this paper, we propose a fully supervised pre-training scheme based on contrastive learning particularly tailored to dense classification tasks. The proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that makes semantic boundaries pop-up by use of a similarity metric between every location in a training sample and its local context. For crop type semantic segmentation from Satellite Image Time Series (SITS) we find performance at parcel boundaries to be a critical bottleneck and explain how CSCL tackles the underlying cause of that problem, improving the state-of-the-art performance in this task. Additionally, using images from the Sentinel-2 (S2) satellite missions we compile the largest, to our knowledge, SITS dataset densely annotated by crop type and parcel identities, which we make publicly available together with the data generation pipeline. Using that data we find CSCL, even with minimal pre-training, to improve all respective baselines and present a process for semantic segmentation at super-resolution for obtaining crop classes at a more granular level. The code and instructions to download the data can be found in https://github.com/michaeltrs/DeepSatModels.

Paper Structure

This paper contains 16 sections, 11 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Pixel size is coarse enough for the signal at boundaries to mix. Here the 10m resolution Sentinel-2 grid is overlaid on a high-resolution image from https://www.google.com/maps/place/46%C2%B026'24.5%22N+5%C2%B000'57.8%22E/@46.439588,5.0160048,714m/data=!3m1!1e3!4m5!3m4!1s0x0:0x0!8m2!3d46.4401361!4d5.0160528. The magnified regions show grid locations that contain signal from multiple crop types and objects. The proposed Context-Self Contrastive Loss directly compares computed embeddings in local neighbourhoods. As such, interior regions naturally have a small contribution to the overall loss while at the same time the network learns to disambiguate boundary pixels by comparing them to nearby locations.
  • Figure 2: In satellite images interior regions of agricultural fields consist of large homogeneous regions with little signal variation. (top-left) Spectral band B08 for a location in France, (bottom-left) ground truth crop type maps for the same location. (right) Histograms of intensities over whole parcel regions (blue) and interior points (orange). Each histogram presents the intensities for a specific field as indicated by the respective mask (red outline). Interior, interior+boundary and exterior locations are shown in white, grey and black colors respectively. We observe that most of the variability in intensities originates from the boundary regions. The distribution of intensities for other spectral bands is similar to that of band B08.
  • Figure 3: Proposed pre-training scheme using the Context-Self Contrastive Loss ( CSCL). For a CNN feature map $\mathbfcal{Y}$ we define a similarity metric $\mathbf{S}_{ij}$ between features extracted at all locations $(i, j)$ and every location in their $w_d$, $w_r$-dilated neighborhood (top branch). Similarly, we derive training labels $\mathbf{L}^s_{ij}$ from dense annotations $\mathbfcal{L}$ for fully supervised training (bottom branch).
  • Figure 4: CSCL ground truth generator example on a 2D crop type label map. To generate labels we compare the class at the center location with all locations in the local neighbourhood defined by parameters $w_d, w_r$. We use windows with parameters (a, b) $w_d=3, w_r=1$, (c) $w_d=3, w_r=3$, (d) $w_d=5, w_r=1$, (e) $w_d=3, w_r=2$, (f) $w_d=5, w_r=2$.
  • Figure 5: Mean prediction accuracy for positive (left) and negative (right) pairs w.r.t location in the sliding window ($w_d=5$). The prediction accuracy drops for positive and increases for negative pairs with increasing distance from the center.
  • ...and 12 more figures