Table of Contents
Fetching ...

Contrastive pretraining for semantic segmentation is robust to noisy positive pairs

Sebastian Gerard, Josephine Sullivan

TL;DR

Domain-specific variants of contrastive learning can construct positive pairs from two distinct in-domain images without having to filter their positive pairs beforehand, and it is found that downstream semantic segmentation is either robust to such badly matched pairs or even benefits from them.

Abstract

Domain-specific variants of contrastive learning can construct positive pairs from two distinct in-domain images, while traditional methods just augment the same image twice. For example, we can form a positive pair from two satellite images showing the same location at different times. Ideally, this teaches the model to ignore changes caused by seasons, weather conditions or image acquisition artifacts. However, unlike in traditional contrastive methods, this can result in undesired positive pairs, since we form them without human supervision. For example, a positive pair might consist of one image before a disaster and one after. This could teach the model to ignore the differences between intact and damaged buildings, which might be what we want to detect in the downstream task. Similar to false negative pairs, this could impede model performance. Crucially, in this setting only parts of the images differ in relevant ways, while other parts remain similar. Surprisingly, we find that downstream semantic segmentation is either robust to such badly matched pairs or even benefits from them. The experiments are conducted on the remote sensing dataset xBD, and a synthetic segmentation dataset for which we have full control over the pairing conditions. As a result, practitioners can use these domain-specific contrastive methods without having to filter their positive pairs beforehand, or might even be encouraged to purposefully include such pairs in their pretraining dataset.

Contrastive pretraining for semantic segmentation is robust to noisy positive pairs

TL;DR

Domain-specific variants of contrastive learning can construct positive pairs from two distinct in-domain images without having to filter their positive pairs beforehand, and it is found that downstream semantic segmentation is either robust to such badly matched pairs or even benefits from them.

Abstract

Domain-specific variants of contrastive learning can construct positive pairs from two distinct in-domain images, while traditional methods just augment the same image twice. For example, we can form a positive pair from two satellite images showing the same location at different times. Ideally, this teaches the model to ignore changes caused by seasons, weather conditions or image acquisition artifacts. However, unlike in traditional contrastive methods, this can result in undesired positive pairs, since we form them without human supervision. For example, a positive pair might consist of one image before a disaster and one after. This could teach the model to ignore the differences between intact and damaged buildings, which might be what we want to detect in the downstream task. Similar to false negative pairs, this could impede model performance. Crucially, in this setting only parts of the images differ in relevant ways, while other parts remain similar. Surprisingly, we find that downstream semantic segmentation is either robust to such badly matched pairs or even benefits from them. The experiments are conducted on the remote sensing dataset xBD, and a synthetic segmentation dataset for which we have full control over the pairing conditions. As a result, practitioners can use these domain-specific contrastive methods without having to filter their positive pairs beforehand, or might even be encouraged to purposefully include such pairs in their pretraining dataset.
Paper Structure (25 sections, 3 equations, 9 figures)

This paper contains 25 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: False vs. noisy positives in contrastive learning We mislabel CIFAR-10 images in contrastive pretraining, creating false positive pairs. These harm downstream classification. On the remote sensing dataset xBD, we pair images that only partly differ (e.g. buildings undamaged vs. damaged). This creates noisy pairs. Surprisingly, they do not harm downstream segmentation.
  • Figure 2: Contrastive learning learns by comparing feature representations $f(\cdot)$. Pairs of representations defined as similar (positive pairs) are pushed closer together by the contrastive loss. Those defined as dissimilar (negative pairs) are pushed apart. For clarity, this illustration omits the details regarding the projection head, momentum encoder and queue of negatives that are used in MoCo. After contrastive pretraining, we freeze the encoder and train a segmentation head on top of it in a supervised way. This allows us to evaluate the quality of the learned representations. In an application scenario, we would finetune the whole network, including the encoder, instead.
  • Figure 3: Generation process of the VTS dataset: The $256{\times}256$ image is segmented into 20 random Voronoi cells, which are evenly randomly distributed between two classes. The cells are filled with a texture images representing the respective classes. A noisy version of the image is created by uniformly at random filling $r_{img}$ of the cells with a noise texture. Noisy positive pairs are created by pairing the noise-free image consisting of only two textures with the noisy image, consisting of three textures. The noise texture can be seen analogously to the damaged buildings in the xBD dataset.
  • Figure 4: VTS: MoCo benefits from the inclusion of noisy positive pairs in contrastive pretraining. More noisy pairs lead to greater improvements, except for per-image noise $r_{img}{=}1.0$, in which completely different images are paired with each other.
  • Figure 5: VTS: Dense losses benefit from the inclusion of noisy positive pairs in contrastive pretraining. More noisy pairs usually do not lead to greater improvements. Per-image noise of 1.0 can lead to a decline in quality, which makes intuitive sense, since it represents pairing two completely different images with each other. Points belonging to Per-image noise: 0.25 are enlarged for better visibility.
  • ...and 4 more figures