Table of Contents
Fetching ...

CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation

Woojung Han, Seil Kang, Kyobin Choo, Seong Jae Hwang

TL;DR

CoBra introduces a dual-branch framework that fuses CNN-based class-aware cues with ViT-based semantic cues to improve weakly supervised semantic segmentation. By learning Class-Aware Projection (CAP) and Semantic-Aware Projection (SAP) and applying contrastive-style losses, the model enables cross-branch guidance that mitigates CNNs’ limited semantic coverage and ViTs’ weak class specificity. The approach yields state-of-the-art seeds, masks, and segmentation on PASCAL VOC 2012 and MS COCO 2014, with ablations confirming the benefit of cross-branch losses and careful seed fusion. This work demonstrates that explicit, principled exchange of class- and semantic-knowledge between CNNs and ViTs can produce robust pseudo labels for pixel-level segmentation under image-level supervision.

Abstract

Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS, the resulting activation maps often narrowly focus on class-specific parts (e.g., only face of human). On the other hand, recent works based on vision transformers (ViT) have shown promising results based on their self-attention mechanism to capture the semantic parts but fail in capturing complete class-specific details (e.g., entire body parts of human but also with a dog nearby). In this work, we propose Complementary Branch (CoBra), a novel dual branch framework consisting of two distinct architectures which provide valuable complementary knowledge of class (from CNN) and semantic (from ViT) to each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly fuse their complementary knowledge and facilitate a new type of extra patch-level supervision. Our model, through CoBra, fuses CNN and ViT's complementary outputs to create robust pseudo masks that integrate both class and semantic information effectively. Extensive experiments qualitatively and quantitatively investigate how CNN and ViT complement each other on the PASCAL VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not only the masks generated by our model, but also the segmentation results derived from utilizing these masks as pseudo labels.

CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation

TL;DR

CoBra introduces a dual-branch framework that fuses CNN-based class-aware cues with ViT-based semantic cues to improve weakly supervised semantic segmentation. By learning Class-Aware Projection (CAP) and Semantic-Aware Projection (SAP) and applying contrastive-style losses, the model enables cross-branch guidance that mitigates CNNs’ limited semantic coverage and ViTs’ weak class specificity. The approach yields state-of-the-art seeds, masks, and segmentation on PASCAL VOC 2012 and MS COCO 2014, with ablations confirming the benefit of cross-branch losses and careful seed fusion. This work demonstrates that explicit, principled exchange of class- and semantic-knowledge between CNNs and ViTs can produce robust pseudo labels for pixel-level segmentation under image-level supervision.

Abstract

Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS, the resulting activation maps often narrowly focus on class-specific parts (e.g., only face of human). On the other hand, recent works based on vision transformers (ViT) have shown promising results based on their self-attention mechanism to capture the semantic parts but fail in capturing complete class-specific details (e.g., entire body parts of human but also with a dog nearby). In this work, we propose Complementary Branch (CoBra), a novel dual branch framework consisting of two distinct architectures which provide valuable complementary knowledge of class (from CNN) and semantic (from ViT) to each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly fuse their complementary knowledge and facilitate a new type of extra patch-level supervision. Our model, through CoBra, fuses CNN and ViT's complementary outputs to create robust pseudo masks that integrate both class and semantic information effectively. Extensive experiments qualitatively and quantitatively investigate how CNN and ViT complement each other on the PASCAL VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not only the masks generated by our model, but also the segmentation results derived from utilizing these masks as pseudo labels.
Paper Structure (32 sections, 4 equations, 9 figures, 12 tables)

This paper contains 32 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of the novel Complementary Branch framework that synergizes the class knowledge of CNN with the semantic understanding of ViT: (a) shows the standard CNN processing focused on class knowledge; (b) depicts the standard ViT utilizing semantic knowledge; (c) presents our integrated approach, where CNN and ViT branches exchange knowledge complementarily; and (d) compare object localization maps from each branch. CNN, ViT, and Cobra branches for various subjects (human, dog, airplane), illustrating the distinctive areas of interest each model identifies. Our model successfully utilizes complementary characteristics to localize the exact object of the correct class and its semantic parts.
  • Figure 2: Overview illustration of our model, Cross Complementary Branch (CoBra). The dual branch framework consists of the Class-Aware Knowledge branch with CNN (top) and the Semantic-Aware Knowledge branch with ViT (bottom). The input image is passed down to both branches. Class-Aware Knowledge (CAK) Branch: The CNN outputs a feature map which generates (1) CNN CAMs via $f_{CAM}$, (2) Pseudo-Labels from CNN CAMs via $argmax$, and (3) Class-Aware Projection (CAP) via $f_{proj}$. Semantic-Aware Knowledge (SAK) Branch: The ViT outputs $N^2$ Patch Embeddings which generate (1) ViT CAMs via $f_{CAM}$ and (2) Semantic-Aware Projection (SAP) via $f_{proj}$. We also use the Attention Maps of all $L$-layers to generate (3) Patch Affinity of size $N^2 \times N^2$. Complementary Branch Losses: Once the necessary outputs are prepared, we employ various losses: (1) $\mathcal{L}_{cls}$: The typical classification loss based on the individual classification predictions of each CNN CAM and ViT CAM. (2) $\mathcal{L}_{cam}$: The L1 loss between the CNN CAM and ViT CAM. (3) $\mathcal{L}_{sap}$: The class-aware knowledge from the pseudo labels guides SAP to identify more accurate class-specific patches. (4) $\mathcal{L}_{cap}$: The semantic-aware knowledge from the patch affinity improves the semantic sensitivity of ViT CAM.
  • Figure 3: Illustration of refining CAP and SAP from SAK and CAK branch respectively. (I) Class Aware Knoweldge(CAK): The CAP values are embedded in the Class Feature Space. (1) The CNN CAM shows that the false negative patches have been weakly localized as horse. (2) The patch affinity from SAK branch assigns the positive (green), negative (red), and neutral (teal) patches based on the target (white) patch. (3) The CAP loss (Eq. \ref{['CAPloss']}) pull those weakly localized patches (i.e., false class negatives) since they are assigned as semantically positive patches based on SAK branch. (4) The CAP is refined to improve the CNN CAM showing fewer false class negatives. (II) Semantic Aware Knowledge(SAK): The SAP values are embedded in the Semantic Feature Space. (1) The ViT CAM shows that the negative patches have been incorrectly localized as horse. (2) The CNN CAM from CAK branch assigns the positive (green), negative (red), and neutral (teal) patches based on the target (white) patch. (3) The SAP loss (Eq. \ref{['SAPloss']}) pushes away those incorrectly localized patches (i.e., false class positives) since they are assigned as negative patches based on CAK branch. (4) The SAP is refined to improve the ViT CAM showing fewer false class positives.
  • Figure 4: Qualitative results. From left: (1) Input image, (2) Our result, (3) CNN CAM of our model, (4) Ours without $\mathcal{L}_{sap}$, (5) ViT CAM of our model, (6) Ours without $\mathcal{L}_{cap}$, (7) Our Pseudo mask for segmentation and (8) ground-truth segmentation label. We see that our results are able to differentiate between classes while finding their accurate object boundaries.
  • Figure 5: Qualitative seg results on the PASCAL VOC val set.
  • ...and 4 more figures