Table of Contents
Fetching ...

CLUSTSEG: Clustering for Universal Segmentation

James Liang, Tianfei Zhou, Dongfang Liu, Wenguan Wang

TL;DR

CLUSTSEG presents a universal, transformer-based framework that unifies superpixel, semantic, instance, and panoptic segmentation by recasting segmentation as iterative clustering. It introduces task-aware Dreamy-Start initialization and a nonparametric Recurrent Cross-Attention mechanism that performs EM-like cluster updates without extra learnable parameters, enabling transparent and effective pixel clustering. Across panoptic, instance, semantic, and superpixel benchmarks, CLUSTSEG achieves state-of-the-art or competitive results and the ablations confirm the critical roles of initialization and recursive clustering. The approach offers a flexible, architecture-agnostic pathway toward unified dense prediction with strong practical implications for large-scale visual understanding.

Abstract

We present CLUSTSEG, a general, transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme. Regarding queries as cluster centers, CLUSTSEG is innovative in two aspects:1) cluster centers are initialized in heterogeneous ways so as to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; and 2) pixel-cluster assignment, formalized in a cross-attention fashion, is alternated with cluster center update, yet without learning additional parameters. These innovations closely link CLUSTSEG to EM clustering and make it a transparent and powerful framework that yields superior results across the above segmentation tasks.

CLUSTSEG: Clustering for Universal Segmentation

TL;DR

CLUSTSEG presents a universal, transformer-based framework that unifies superpixel, semantic, instance, and panoptic segmentation by recasting segmentation as iterative clustering. It introduces task-aware Dreamy-Start initialization and a nonparametric Recurrent Cross-Attention mechanism that performs EM-like cluster updates without extra learnable parameters, enabling transparent and effective pixel clustering. Across panoptic, instance, semantic, and superpixel benchmarks, CLUSTSEG achieves state-of-the-art or competitive results and the ablations confirm the critical roles of initialization and recursive clustering. The approach offers a flexible, architecture-agnostic pathway toward unified dense prediction with strong practical implications for large-scale visual understanding.

Abstract

We present CLUSTSEG, a general, transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme. Regarding queries as cluster centers, CLUSTSEG is innovative in two aspects:1) cluster centers are initialized in heterogeneous ways so as to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; and 2) pixel-cluster assignment, formalized in a cross-attention fashion, is alternated with cluster center update, yet without learning additional parameters. These innovations closely link CLUSTSEG to EM clustering and make it a transparent and powerful framework that yields superior results across the above segmentation tasks.
Paper Structure (25 sections, 10 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 10 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: ClustSeg unifies four segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) from the clustering view, and greatly suppresses existing specialized and unified models.
  • Figure 2: Dreamy-Start$_{\!}$ for$_{\!}$ query$_{\!}$ initialization.$_{\!}$ (a)$_{\!}$ To$_{\!}$ respect$_{\!}$ the$_{\!}$ cross-scene$_{\!}$ semantically$_{\!}$ consistent$_{\!}$ nature$_{\!}$ of$_{\!}$ semantic/stuff$_{\!}$ segmentation, the$_{\!}$ quries/seeds$_{\!}$ are$_{\!}$ initialized$_{\!}$ as$_{\!}$ class$_{\!}$ centers$_{\!}$ (Eq.$_{\!}$\ref{['eq:stuffquery']}).$_{\!}$ (b)$_{\!}$ To$_{\!}$ meet$_{\!}$ the$_{\!}$ instance-aware$_{\!}$ demand$_{\!}$ of$_{\!}$ instance/thing$_{\!}$ segmentation,$_{\!}$ the$_{\!}$ initial$_{\!}$ seeds are$_{\!}$ emerged$_{\!}$ from$_{\!}$ the$_{\!}$ input$_{\!}$ image$_{\!}$ (Eq.$_{\!}$\ref{['eq:thingquery']}). (c)$_{\!}$ To$_{\!}$ generate$_{\!}$ varying$_{\!}$ number$_{\!}$ of$_{\!}$ superpixels,$_{\!}$ the$_{\!}$ seeds$_{\!}$ are$_{\!}$ initialized$_{\!}$ from$_{\!}$ image$_{\!}$ grids$_{\!}$ (Eq.$_{\!}$\ref{['eq:superpixelquery']}).
  • Figure 3: (a) Recurrent Cross-attention instantiates EM clustering for segment-by-clustering. (b) Each Recurrent Cross-attention layer executes $T$ iterations of clustering assignment (E-step) and center update (M-step). (c) Overall architecture of ClustSeg.
  • Figure 4: ClustSeg reaches the best ASA and CO scores on BSDS500 arbelaez2011contourtest, among all the deep learning based superpixel models (see §\ref{['sec:SuS']} for details).
  • Figure 5: ClustSeg reaches the best ASA and CO scores on NYUv2 silberman2012indoortest (see §\ref{['sec:sup']} for details).
  • ...and 6 more figures