Table of Contents
Fetching ...

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Yajie Liu, Pu Ge, Qingjie Liu, Di Huang

TL;DR

A Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations is introduced.

Abstract

Recently, learning open-vocabulary semantic segmentation from text supervision has achieved promising downstream performance. Nevertheless, current approaches encounter an alignment granularity gap owing to the absence of dense annotations, wherein they learn coarse image/region-text alignment during training yet perform group/pixel-level predictions at inference. Such discrepancy leads to suboptimal learning efficiency and inferior zero-shot segmentation results. In this paper, we introduce a Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations. Specifically, MGCA ingeniously constructs pseudo multi-granular semantic correspondences upon image-text pairs and collaborates with hard sampling strategies to facilitate fine-grained cross-modal contrastive learning. Further, we point out the defects of existing group and pixel prediction units in downstream segmentation and develop an adaptive semantic unit which effectively mitigates their dilemmas including under- and over-segmentation. Training solely on CC3M, our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

TL;DR

A Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations is introduced.

Abstract

Recently, learning open-vocabulary semantic segmentation from text supervision has achieved promising downstream performance. Nevertheless, current approaches encounter an alignment granularity gap owing to the absence of dense annotations, wherein they learn coarse image/region-text alignment during training yet perform group/pixel-level predictions at inference. Such discrepancy leads to suboptimal learning efficiency and inferior zero-shot segmentation results. In this paper, we introduce a Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations. Specifically, MGCA ingeniously constructs pseudo multi-granular semantic correspondences upon image-text pairs and collaborates with hard sampling strategies to facilitate fine-grained cross-modal contrastive learning. Further, we point out the defects of existing group and pixel prediction units in downstream segmentation and develop an adaptive semantic unit which effectively mitigates their dilemmas including under- and over-segmentation. Training solely on CC3M, our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
Paper Structure (12 sections, 8 equations, 8 figures, 4 tables)

This paper contains 12 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Conceptual comparison between previous methods and ours. Group-based methods encounter an image-group alignment gap while pixel-wise methods confront a region-pixel alignment gap. In contrast, our method bridges the train-test alignment gap and introduces adaptive prediction that mitigates under-/over-segmentation issues.
  • Figure 2: The architecture and overall pipeline of our method. $E_v$ and $E_t$ are initialized with CLIP and frozen. We train the decoder $D_v$ only. As illustrated by the provided examples, Multi-Grained Cross-modal Alignment (MGCA) innovatively constructs object/region/pixel-level semantic correspondence, which enables the model to learn fine-grained alignment without dense annotations. During inference, we discard MGCA and aggregate pixel embeddings into adaptive semantic units for predictions.
  • Figure 3: Illustration of the proposed Multi-Grained Cross-modal Alignment (MGCA). Based on the pixel-to-text similarity matrix $S_{ij}$, we identify informative positive and negative pairs for object-, region- and pixel-level contrastive learning.
  • Figure 4: Examples of our semantic units. Pixels with the same color belong to the same unit. Our semantic units align with part-level representations, such as the wheel hub in the first image and the hat and shoes in the second image.
  • Figure 5: Visualization of the impact of each module. The baseline corresponds to the first row in \ref{['tab: grained']} which directly employs CLIP for dense prediction. We progressively integrate object-, region- and pixel-level alignment modules, along with the proposed unit, into the model to qualitatively demonstrate the impact of each module.
  • ...and 3 more figures