Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Yajie Liu; Pu Ge; Qingjie Liu; Di Huang

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Yajie Liu, Pu Ge, Qingjie Liu, Di Huang

TL;DR

A Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations is introduced.

Abstract

Recently, learning open-vocabulary semantic segmentation from text supervision has achieved promising downstream performance. Nevertheless, current approaches encounter an alignment granularity gap owing to the absence of dense annotations, wherein they learn coarse image/region-text alignment during training yet perform group/pixel-level predictions at inference. Such discrepancy leads to suboptimal learning efficiency and inferior zero-shot segmentation results. In this paper, we introduce a Multi-Grained Cross-modal Alignment (MGCA) framework, which explicitly learns pixel-level alignment along with object- and region-level alignment to bridge the granularity gap without any dense annotations. Specifically, MGCA ingeniously constructs pseudo multi-granular semantic correspondences upon image-text pairs and collaborates with hard sampling strategies to facilitate fine-grained cross-modal contrastive learning. Further, we point out the defects of existing group and pixel prediction units in downstream segmentation and develop an adaptive semantic unit which effectively mitigates their dilemmas including under- and over-segmentation. Training solely on CC3M, our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

TL;DR

Abstract

Paper Structure (12 sections, 8 equations, 8 figures, 4 tables)

This paper contains 12 sections, 8 equations, 8 figures, 4 tables.

Introduction
Related Work
Method
Problem Definition
Multi-Grained Cross-Modal Alignment
Adaptive semantic unit
Experiments
Implementation details
Zero-shot Transfer to Semantic Segmentation
Ablation Studies
Visualization
Conclusion

Figures (8)

Figure 1: Conceptual comparison between previous methods and ours. Group-based methods encounter an image-group alignment gap while pixel-wise methods confront a region-pixel alignment gap. In contrast, our method bridges the train-test alignment gap and introduces adaptive prediction that mitigates under-/over-segmentation issues.
Figure 2: The architecture and overall pipeline of our method. $E_v$ and $E_t$ are initialized with CLIP and frozen. We train the decoder $D_v$ only. As illustrated by the provided examples, Multi-Grained Cross-modal Alignment (MGCA) innovatively constructs object/region/pixel-level semantic correspondence, which enables the model to learn fine-grained alignment without dense annotations. During inference, we discard MGCA and aggregate pixel embeddings into adaptive semantic units for predictions.
Figure 3: Illustration of the proposed Multi-Grained Cross-modal Alignment (MGCA). Based on the pixel-to-text similarity matrix $S_{ij}$, we identify informative positive and negative pairs for object-, region- and pixel-level contrastive learning.
Figure 4: Examples of our semantic units. Pixels with the same color belong to the same unit. Our semantic units align with part-level representations, such as the wheel hub in the first image and the hat and shoes in the second image.
Figure 5: Visualization of the impact of each module. The baseline corresponds to the first row in \ref{['tab: grained']} which directly employs CLIP for dense prediction. We progressively integrate object-, region- and pixel-level alignment modules, along with the proposed unit, into the model to qualitatively demonstrate the impact of each module.
...and 3 more figures

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

TL;DR

Abstract

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (8)