Table of Contents
Fetching ...

Contextrast: Contextual Contrastive Learning for Semantic Segmentation

Changki Sung, Wanhee Kim, Jungho An, Wooju Lee, Hyungtae Lim, Hyun Myung

TL;DR

The proposed Contextrast substantially enhances the performance of semantic segmentation networks, outper-forming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCALC, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.

Abstract

Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile, BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning, resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks, outperforming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.

Contextrast: Contextual Contrastive Learning for Semantic Segmentation

TL;DR

The proposed Contextrast substantially enhances the performance of semantic segmentation networks, outper-forming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCALC, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.

Abstract

Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile, BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning, resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks, outperforming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.
Paper Structure (14 sections, 8 equations, 5 figures, 8 tables)

This paper contains 14 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a) Ground truth, (b) output of HRNet sun2019high, (c) and that of ours. (d) Overview of our contextual contrastive learning (CCL): the representative anchors of the last layer, which are from the higher embedding space levels, are aggregated to representative anchors of the lower layer to encapsulate local and global context. By doing so, the anchor of the $n$-th class on the $i$-th layer $\mathbf{a}^n_i$ is updated as $\hat{\mathbf{a}}^n_i$ (on the right side, its position is shifted), enhancing the distinctiveness between anchors of each class. (e) Visual description of our boundary-aware negative (BANE) sampling (triangles with red color and red borders). Our sampling prioritizes selecting the features of incorrect predictions at the edges (red triangles) rather than those inside the regions (triangles with red borders) as negative samples. Each shape represents an embedding vector derived from the respective class (best viewed in color).
  • Figure 2: Overall Contextrast framework. Contextrast utilizes the representative anchors updated by the semantically rich representative anchor vector set $\mathbf{A}_I$. Thus, it integrates local/global contexts and their relationships. Then, BANE sampling samples examples that exist along the boundaries of prediction error regions. It samples more informative negative samples and captures fine-grained details for contrastive learning. $\mathbf{I}_\text{Batch}$ is the batch images. $\hat{\mathbf{Y}}$ is the prediction outcome from the model. $\mathbf{F}_\mathit{i}$ is the feature map of the $\mathit{i}$-th encoder layer. $\mathbf{V}_\mathit{i}$ is the $\mathit{i}$-th set of the embedded feature vector by the encoding function $\pi(\cdot)$. $\mathbf{A}_\mathit{i}$ denotes the representative anchors of the $\mathit{i}$-th embedded feature vector. The updated representative anchor $\hat{\mathbf{A}}_{\mathit{i}}$ results from adding low-level and highest-level anchors. $w_h$ and $w_l$ are weight hyperparameters for updating representative anchors. The $L_{\text{PA}}$ is the proposed pixel-anchor loss function. $L_{\text{CE}}$ represents the cross-entropy loss function. Features of each semantic class are illustrated in different shapes and colors (best viewed in color).
  • Figure 3: Visual description of boundary-aware negative sampling and how the under/over-segmentation problems are addressed during the training. (a) The prediction outcome $\hat{\mathbf{Y}}$ is decomposed into class-wise binary maps $\textbf{B}^\mathit{n}_\mathit{i}$. Then, class-wise distance maps $\textbf{D}^\mathit{n}_\mathit{i}$ are generated with the Distance Transform kimmel1996sub. (b) The evolution of the distance map over iterations. The wrongly predicted regions shrink during training (best viewed in color).
  • Figure 4: Qualitative results from HRNet sun2019high, HRNet + pissas2022multi, and HRNet + Ours on the Cityscapes, ADE20K, and COCO-Stuff datasets, respectively (best viewed on color).
  • Figure 5: Average cosine similarity between error pixels and representative anchors in each layer was computed based on the distance from incorrect prediction boundaries. The results demonstrate that samples located along the incorrect prediction boundaries are harder-negative samples compared with features in the inner region.