Table of Contents
Fetching ...

A Contrastive Learning Scheme with Transformer Innate Patches

Sander Riisøen Jyhne, Per-Arne Andersen, Morten Goodwin

TL;DR

This work tackles accurate aerial image segmentation under class imbalance and fine-grained class boundaries by leveraging the intrinsic patch structure of vision transformers. It proposes Contrastive Transformer (CT), a patch-based, end-to-end contrastive learning scheme that performs intra- and inter-image sampling guided by ground-truth masks and applies losses at multiple encoder stages. Together with a standard segmentation objective, CT yields consistent mean IoU improvements across three backbones (Swin, UnetFormer, PoolFormer) on the ISPRS Potsdam dataset using InfoNCE or a cosine-based contrastive loss. The approach is memory-efficient, avoids large batch requirements, and generalizes across architectures, offering a practical route to enhance dense prediction in aerial and potentially other domains.

Abstract

This paper presents Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. Contrastive Transformer enables existing contrastive learning techniques, often used for image classification, to benefit dense downstream prediction tasks such as semantic segmentation. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask, subsequently used for hard-negative and hard-positive sampling. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint. Additionally, the scheme removes the need for huge batch sizes, as each patch is treated as an image. We apply and test Contrastive Transformer for the case of aerial image segmentation, known for low-resolution data, large class imbalance, and similar semantic classes. We perform extensive experiments to show the efficacy of the Contrastive Transformer scheme on the ISPRS Potsdam aerial image segmentation dataset. Additionally, we show the generalizability of our scheme by applying it to multiple inherently different Transformer architectures. Ultimately, the results show a consistent increase in mean IoU across all classes.

A Contrastive Learning Scheme with Transformer Innate Patches

TL;DR

This work tackles accurate aerial image segmentation under class imbalance and fine-grained class boundaries by leveraging the intrinsic patch structure of vision transformers. It proposes Contrastive Transformer (CT), a patch-based, end-to-end contrastive learning scheme that performs intra- and inter-image sampling guided by ground-truth masks and applies losses at multiple encoder stages. Together with a standard segmentation objective, CT yields consistent mean IoU improvements across three backbones (Swin, UnetFormer, PoolFormer) on the ISPRS Potsdam dataset using InfoNCE or a cosine-based contrastive loss. The approach is memory-efficient, avoids large batch requirements, and generalizes across architectures, offering a practical route to enhance dense prediction in aerial and potentially other domains.

Abstract

This paper presents Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. Contrastive Transformer enables existing contrastive learning techniques, often used for image classification, to benefit dense downstream prediction tasks such as semantic segmentation. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask, subsequently used for hard-negative and hard-positive sampling. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint. Additionally, the scheme removes the need for huge batch sizes, as each patch is treated as an image. We apply and test Contrastive Transformer for the case of aerial image segmentation, known for low-resolution data, large class imbalance, and similar semantic classes. We perform extensive experiments to show the efficacy of the Contrastive Transformer scheme on the ISPRS Potsdam aerial image segmentation dataset. Additionally, we show the generalizability of our scheme by applying it to multiple inherently different Transformer architectures. Ultimately, the results show a consistent increase in mean IoU across all classes.
Paper Structure (10 sections, 3 figures, 1 table)

This paper contains 10 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: CT uses the innate patch representations from each encoder stage and calculates the contrastive loss between positive and negative samples using the ground truth mask.
  • Figure 2: The color-coded squares in the patch mask represent patches with a homogenous class distribution, while the white squares represent patches with a mixed class distribution including the target class. Using the ground truth mask it knows which feature representations to use as positive and negative samples. Positive feature patches consist of a uniform distribution of the target class. In contrast, negative patches may contain a mixture of classes or may be uniform as long as they exclude the target class. Patches with a mixture of classes including the target class are discarded. Ultimately, the selected positive and negative patches contribute to the contrastive loss function, pulling the representations of the positive patches closer together and pushing away the negative patch representations.
  • Figure 3: Qualitative comparison for all models depicting the enhanced representations from the CT learning scheme. The areas of interest is highlighted with a red circle, showing examples where the difference between baseline and CT is prominent. All visual samples for the comparison has been gathered from the best run across three distinct runs.