Table of Contents
Fetching ...

EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation

Chanyoung Kim, Woojung Han, Dayun Ju, Seong Jae Hwang

TL;DR

EAGLE tackles unsupervised semantic segmentation of complex objects by introducing EiCue, an eigenbasis cue derived from a learnable graph Laplacian that fuses color affinity with semantic similarity. It couples EiCue with an object-centric contrastive learning framework that builds learnable prototypes for each object and enforces intra- and inter-image consistency via a two-direction NCE loss, integrated into a total objective L_total = λ_{nce} L_{nce}^{x↔~x} + (1−λ_{nce}) L_{corr} + λ_{eig} L_{eig}. The approach leverages differentiable eigen clustering to obtain object-level structure and demonstrates state-of-the-art performance on COCO-Stuff, Cityscapes, and Potsdam-3, underscoring its capacity to discern object semantics across diverse scenes. While effective, the method incurs higher training costs due to adjacency/Laplacian construction, highlighting a trade-off between accuracy and computation and suggesting avenues for sampling-based EiCue construction in broader domains.

Abstract

Semantic segmentation has innately relied on extensive pixel-level annotated data, leading to the emergence of unsupervised methodologies. Among them, leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet, for semantically segmenting images with complex objects, a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap, we present a novel approach, EAGLE, which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically, we introduce EiCue, a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further, by incorporating our object-centric contrastive loss with EiCue, we guide our model to learn object-level representations with intra- and inter-image object-feature consistency, thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes.

EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation

TL;DR

EAGLE tackles unsupervised semantic segmentation of complex objects by introducing EiCue, an eigenbasis cue derived from a learnable graph Laplacian that fuses color affinity with semantic similarity. It couples EiCue with an object-centric contrastive learning framework that builds learnable prototypes for each object and enforces intra- and inter-image consistency via a two-direction NCE loss, integrated into a total objective L_total = λ_{nce} L_{nce}^{x↔~x} + (1−λ_{nce}) L_{corr} + λ_{eig} L_{eig}. The approach leverages differentiable eigen clustering to obtain object-level structure and demonstrates state-of-the-art performance on COCO-Stuff, Cityscapes, and Potsdam-3, underscoring its capacity to discern object semantics across diverse scenes. While effective, the method incurs higher training costs due to adjacency/Laplacian construction, highlighting a trade-off between accuracy and computation and suggesting avenues for sampling-based EiCue construction in broader domains.

Abstract

Semantic segmentation has innately relied on extensive pixel-level annotated data, leading to the emergence of unsupervised methodologies. Among them, leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet, for semantically segmenting images with complex objects, a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap, we present a novel approach, EAGLE, which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically, we introduce EiCue, a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further, by incorporating our object-centric contrastive loss with EiCue, we guide our model to learn object-level representations with intra- and inter-image object-feature consistency, thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes.
Paper Structure (20 sections, 11 equations, 15 figures, 7 tables)

This paper contains 20 sections, 11 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We introduce EAGLE, Eigen AGgregation LEarning for object-centric unsupervised semantic segmentation. (a) We first leverage the aggregated eigenvectors, named EiCue, to obtain the semantic structure knowledge of object segments in an image. Based on both semantic and structural cues from the EiCue, we compute object-centric contrastive loss to learn object-level semantic representation. (b) A visual comparison between EAGLE and other methods. Our object-level semantic segmentation results robustly identify objects with complex semantics (e.g., blanket with vivid stripe patterns) by exploiting strong semantic structure cues from EiCue.
  • Figure 2: The pipeline of EAGLE. Leveraging the Laplacian matrix, which integrates hierarchically projected image key features and color affinity, the model exploits eigenvector clustering to capture object-level perspective cues defined as $\mathcal{M}_{\text{eicue}}$ and $\Tilde{\mathcal{M}}_{\text{eicue}}$. Distilling knowledge from $\mathcal{M}_{\text{eicue}}$, our model further adopts an object-centric contrastive loss, utilizing the projected feature $\mathbf{Z}$ and $\Tilde{\mathbf{Z}}$. The learnable prototype $\Phi$ assigned from $\mathbf{Z}$ and $\Tilde{\mathbf{Z}}$, acts as a singular anchor that contrasts positive objects and negative objects. Our object-centric contrastive loss is computed in two distinct manners: intra($\mathcal{L}_{\text{obj}}$)- and inter($\mathcal{L}_{\text{sc}}$)-image to ensure semantic consistency.
  • Figure 3: An illustration of the EiCue generation process. From the input image, both color affinity matrix $\mathbf{A}_\text{color}$ and semantic similarity matrix $\mathbf{A}_\text{seg}$ are derived, which are combined to form the Laplacian $\mathbf{L}_\text{sym}$. An eigenvector subset $\hat{\mathbf{V}}$ of $\mathbf{L}_\text{sym}$ are clustered to produce EiCue.
  • Figure 4: Visualizing eigenvectors derived from $\mathbf{S}$ in the Eigen Aggregation Module. These eigenvectors not only distinguish different objects but also identify semantically related areas, highlighting how EiCue captures object semantics and boundaries effectively.
  • Figure 5: A qualitative comparison of the (a) COCO-Stuff coco and (b) Cityscapes cityscapes datasets trained using ViT-S/8 and ViT-B/8 as a backbone, respectively. The comparison included previous state-of-the-art USS approaches, STEGO stego, HP HP, and ours.
  • ...and 10 more figures