Table of Contents
Fetching ...

Multi-Modal Prototypes for Open-World Semantic Segmentation

Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

TL;DR

Open-world semantic segmentation requires identifying both seen and unseen categories at inference. The authors introduce a multi-modal prototype framework that constructs multiple visual prototypes from region-aware visual aggregation and multiple textual prototypes through decomposed granular descriptions, fused via a cross-attention mechanism and an elastic mask prediction module. Key innovations include M-Splitting for fast region-based visual prototypes, LLM-driven textual decomposition with CLIP embeddings, and a complementary fusion strategy that leverages both modalities for robust segmentation across zero-shot, few-shot, and generalized few-shot settings. Empirical results on PASCAL-$5^i$ and COCO-$20^i$ demonstrate state-of-the-art performance and strong generalization to unseen classes, highlighting the practical potential of multi-modal prototypes in open-world vision tasks.

Abstract

In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-$5^i$ and COCO-$20^i$ datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.

Multi-Modal Prototypes for Open-World Semantic Segmentation

TL;DR

Open-world semantic segmentation requires identifying both seen and unseen categories at inference. The authors introduce a multi-modal prototype framework that constructs multiple visual prototypes from region-aware visual aggregation and multiple textual prototypes through decomposed granular descriptions, fused via a cross-attention mechanism and an elastic mask prediction module. Key innovations include M-Splitting for fast region-based visual prototypes, LLM-driven textual decomposition with CLIP embeddings, and a complementary fusion strategy that leverages both modalities for robust segmentation across zero-shot, few-shot, and generalized few-shot settings. Empirical results on PASCAL- and COCO- demonstrate state-of-the-art performance and strong generalization to unseen classes, highlighting the practical potential of multi-modal prototypes in open-world vision tasks.

Abstract

In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL- and COCO- datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.
Paper Structure (34 sections, 13 equations, 9 figures, 11 tables, 2 algorithms)

This paper contains 34 sections, 13 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: (A) Single-prototype-based paradigm. The model learns a single prototype from uni-modal information and uses it as a semantic indicator for segmentation tasks. (B) Straightforward combination. It's ineffective to straightforwardly combine the two modality through prototype addition. (C) Multi-modal-prototype-based segmentation framework. Multiple prototypes are obtained through visual aggregation and textual decomposition, followed by the integration of complementary fusion to acquire multi-modal prototypes.
  • Figure 2: Framework Overview.(A) Textual prototypes through decomposition: to enrich the context and eliminate ambiguity, we decompose their semantics into fine-grained descriptions using LLMs. (B) Visual prototypes through aggregation: we split mask into regions and aggregate features accordingly to establish multiple inherently consistent prototypes. (C) Fusing multi-modal prototypes: to learn powerful multi-modal prototypes, we design a complementary fusion module that effectively mediates the relevance between prototypes. (D) Mask Prediction: we design a comprehensive mask calculating module that permits any number and form of prototype inputs.
  • Figure 3: Visual prototypes through aggregation. We split the support mask into several regions using the M-Splitting algorithm and average visual feature on each region, forming several tokens as visual prototypes.
  • Figure 4: Multiple-prototype-based mask prediction pipeline. Different prototypes are seen as independent classifiers, and compete with each other through attention mechanism. Prototypes of the same class share the same $\mathbf{V}$ and can be grouped together during the attention process.
  • Figure 5: An illustration of \ref{['eq:multi_level']}. The multi-level prediction is fused one-by-one to get the final prediction.
  • ...and 4 more figures