Table of Contents
Fetching ...

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

Soojin Jang, Jungmin Yun, Junehyoung Kwon, Eunju Lee, Youngbin Kim

TL;DR

DALNet, Dense Alignment Learning Network is introduced that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity and allows for more efficient end-to-end process as a single-stage method.

Abstract

Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

TL;DR

DALNet, Dense Alignment Learning Network is introduced that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity and allows for more efficient end-to-end process as a single-stage method.

Abstract

Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image-level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual-level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross-contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state-of-the-art WSSS methods. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
Paper Structure (17 sections, 8 equations, 6 figures, 4 tables)

This paper contains 17 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of (Left) existing WSSS methods and (Right) proposed DALNet. (Left) Existing methods for implicit alignment depend on global image features, potentially missing local details within the image. (Right) In contrast, the proposed DALNet integrates global and local features to preserve spatial details and facilitate explicit alignment. It distinguishes between foreground and background in image patches and text, addressing various levels of granularity. The dual alignment mechanism captures diverse object regions without any pre-defined category or external model.
  • Figure 2: Overview of the proposed DALNet. This approach employs a visual encoder to extract features and an object-aware mask $M_{obj}$ to distinguish foreground and background. Text prompts are fed into a text encoder to generate embeddings for target objects and the background. Cross-contrastive learning aligns representations from both modalities, associating each token with either the foreground or background. GIA contrasts the class token with text embeddings to incorporate global information, while LEA leverages patch tokens and text embeddings for precise localization.
  • Figure 3: Visualization results of CAMs. We generate CAMs using CNN baseline, ViT, CLIMS xie2022clims, CLIP-ES lin2023clip and our proposed DALNet.
  • Figure 4: Visualization results of semantic segmentation on the PASCAL VOC and MS COCO datasets.
  • Figure 5: (Left) Visualization of initial CAMs with different loss function configurations. (Right) Visualization results for the CAMs on the PASCAL VOC dataset without and with cross-contrastive learning (CCL).
  • ...and 1 more figures