Table of Contents
Fetching ...

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Kwanyoung Kim, Yujin Oh, Jong Chul Ye

TL;DR

OTSeg addresses zero-shot semantic segmentation by formulating text-pixel alignment as an Optimal Transport problem and enriching it with multiple text prompts. It introduces Multi-Prompts Sinkhorn (MPS) to transport distributions from text prompts to image pixels and extends this with Multi-Prompts Sinkhorn Attention (MPSA) to integrate cross-modal matching inside a Transformer decoder, with an ensemble at inference (OTSeg+). The approach leverages a frozen CLIP text encoder with a tunable image encoder and a relationship descriptor to refine text embeddings, achieving state-of-the-art results on VOC 2012, PASCAL Context, and COCO-Stuff164K in both inductive and transductive ZS3 settings, and demonstrating robustness and favorable efficiency. This work demonstrates strong multimodal alignment capabilities for open-vocabulary segmentation and suggests promising directions for extending OT-based prompts to broader multimodal vision-language tasks.

Abstract

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

TL;DR

OTSeg addresses zero-shot semantic segmentation by formulating text-pixel alignment as an Optimal Transport problem and enriching it with multiple text prompts. It introduces Multi-Prompts Sinkhorn (MPS) to transport distributions from text prompts to image pixels and extends this with Multi-Prompts Sinkhorn Attention (MPSA) to integrate cross-modal matching inside a Transformer decoder, with an ensemble at inference (OTSeg+). The approach leverages a frozen CLIP text encoder with a tunable image encoder and a relationship descriptor to refine text embeddings, achieving state-of-the-art results on VOC 2012, PASCAL Context, and COCO-Stuff164K in both inductive and transductive ZS3 settings, and demonstrating robustness and favorable efficiency. This work demonstrates strong multimodal alignment capabilities for open-vocabulary segmentation and suggests promising directions for extending OT-based prompts to broader multimodal vision-language tasks.

Abstract

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
Paper Structure (33 sections, 18 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 18 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Visualization of proposed Multi-Prompts Sinkhorn Atttention (MPSA) for text-driven semantic segmentation. (a) Without MPSA, all the text prompt-related score maps ${S}^i$ are cohered. (b) With MPSA, each ${S}^i$ selectively focuses on different semantic attributes, resulting the final score map effectively attends to the target object.
  • Figure 2: Comparison of attention mechanism variants. (a) Cross-attention mechanism for multimodal settings. (b) Sinkformer self-attention mechanism for unimodal settings. (c) Our proposed Muti-Prompt Sinkhorn Attention (MPSA) for multimodal settings, which aims to optimally transport image pixel (M) to multiple text prompts (N).
  • Figure 3: Overview of OTSeg for zero-shot semantic segmentation. (a) MPS path refines the score map using the MPS algorithm. (b) Decoder path involves the decoder output, which integrates the Multi-Prompts Sinkhorn Attention (MPSA) predictions. (c) During inference, OTSeg ensembles predictions from both paths with a balancing factor $\lambda$.
  • Figure 4: Qualitative comparison with previous SOTA models on COCO-Stuff164K dataset. Green tag indicates unseen classes, while yellow indicates seen classes.
  • Figure 5: Visual comparison of prompt-related score map. While all the text prompt-related score map ${S}^i$ are cohered without MPSA, with our MPOT, each ${S}^i$ is diversely activated and focuses on different semantic attributes (white arrows), which helps the model effectively differentiates the target object from the background (red arrows).
  • ...and 3 more figures