OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation
Kwanyoung Kim, Yujin Oh, Jong Chul Ye
TL;DR
OTSeg addresses zero-shot semantic segmentation by formulating text-pixel alignment as an Optimal Transport problem and enriching it with multiple text prompts. It introduces Multi-Prompts Sinkhorn (MPS) to transport distributions from text prompts to image pixels and extends this with Multi-Prompts Sinkhorn Attention (MPSA) to integrate cross-modal matching inside a Transformer decoder, with an ensemble at inference (OTSeg+). The approach leverages a frozen CLIP text encoder with a tunable image encoder and a relationship descriptor to refine text embeddings, achieving state-of-the-art results on VOC 2012, PASCAL Context, and COCO-Stuff164K in both inductive and transductive ZS3 settings, and demonstrating robustness and favorable efficiency. This work demonstrates strong multimodal alignment capabilities for open-vocabulary segmentation and suggests promising directions for extending OT-based prompts to broader multimodal vision-language tasks.
Abstract
The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
