Table of Contents
Fetching ...

Learning Invariant Inter-pixel Correlations for Superpixel Generation

Sen Xu, Shikui Wei, Tao Ruan, Lixin Liao

TL;DR

The paper addresses the sensitivity of deep superpixel methods to training data statistics and high-level semantics, which compromises generalization in open-world settings. It proposes Content Disentangle Superpixel (CDS), which uses auxiliary modalities to separate invariant inter-pixel content from style noise through local-grid correlation alignment and global-style mutual information minimization, trained end-to-end with a shared superpixel decoder. The objective combines alignment, mutual information minimization, and superpixel losses, enabling a modality-aware yet inference-efficient pipeline. Experiments on four diverse datasets show CDS achieves superior boundary adherence, generalization, and efficiency, with auxiliary modalities used only during training, and gains also translating to improved downstream semantic segmentation tasks.

Abstract

Deep superpixel algorithms have made remarkable strides by substituting hand-crafted features with learnable ones. Nevertheless, we observe that existing deep superpixel methods, serving as mid-level representation operations, remain sensitive to the statistical properties (e.g., color distribution, high-level semantics) embedded within the training dataset. Consequently, learnable features exhibit constrained discriminative capability, resulting in unsatisfactory pixel grouping performance, particularly in untrainable application scenarios. To address this issue, we propose the Content Disentangle Superpixel (CDS) algorithm to selectively separate the invariant inter-pixel correlations and statistical properties, i.e., style noise. Specifically, We first construct auxiliary modalities that are homologous to the original RGB image but have substantial stylistic variations. Then, driven by mutual information, we propose the local-grid correlation alignment across modalities to reduce the distribution discrepancy of adaptively selected features and learn invariant inter-pixel correlations. Afterwards, we perform global-style mutual information minimization to enforce the separation of invariant content and train data styles. The experimental results on four benchmark datasets demonstrate the superiority of our approach to existing state-of-the-art methods, regarding boundary adherence, generalization, and efficiency. Code and pre-trained model are available at https://github.com/rookiie/CDSpixel.

Learning Invariant Inter-pixel Correlations for Superpixel Generation

TL;DR

The paper addresses the sensitivity of deep superpixel methods to training data statistics and high-level semantics, which compromises generalization in open-world settings. It proposes Content Disentangle Superpixel (CDS), which uses auxiliary modalities to separate invariant inter-pixel content from style noise through local-grid correlation alignment and global-style mutual information minimization, trained end-to-end with a shared superpixel decoder. The objective combines alignment, mutual information minimization, and superpixel losses, enabling a modality-aware yet inference-efficient pipeline. Experiments on four diverse datasets show CDS achieves superior boundary adherence, generalization, and efficiency, with auxiliary modalities used only during training, and gains also translating to improved downstream semantic segmentation tasks.

Abstract

Deep superpixel algorithms have made remarkable strides by substituting hand-crafted features with learnable ones. Nevertheless, we observe that existing deep superpixel methods, serving as mid-level representation operations, remain sensitive to the statistical properties (e.g., color distribution, high-level semantics) embedded within the training dataset. Consequently, learnable features exhibit constrained discriminative capability, resulting in unsatisfactory pixel grouping performance, particularly in untrainable application scenarios. To address this issue, we propose the Content Disentangle Superpixel (CDS) algorithm to selectively separate the invariant inter-pixel correlations and statistical properties, i.e., style noise. Specifically, We first construct auxiliary modalities that are homologous to the original RGB image but have substantial stylistic variations. Then, driven by mutual information, we propose the local-grid correlation alignment across modalities to reduce the distribution discrepancy of adaptively selected features and learn invariant inter-pixel correlations. Afterwards, we perform global-style mutual information minimization to enforce the separation of invariant content and train data styles. The experimental results on four benchmark datasets demonstrate the superiority of our approach to existing state-of-the-art methods, regarding boundary adherence, generalization, and efficiency. Code and pre-trained model are available at https://github.com/rookiie/CDSpixel.
Paper Structure (28 sections, 18 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 18 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Motivation. (a) Visualization of the t-SNE distributions on the BSDS dataset. From left to right are the baseline and our CDS. After applying color inversion, the feature distribution of baseline displays a noticeable decision boundary. In contrast, the feature distribution extracted by CDS is more compact (i.e.,[-80, +80] vs [-40, +40]) and indivisible. (b) Gradually modifying the stylistic information of both auxiliary and original data enhances the purity of the shared invariant inter-pixel correlations.
  • Figure 2: Flowchart of the proposed content disentangle superpixel algorithm.
  • Figure 3: Illustration of the Local-grid Correlation Alignment (LCA) mechanism. LCA performs spatial domain distribution alignment at the superpixel level.
  • Figure 4: Performance comparison on four datasets from different domains. From Left to Right: BSDS, NYU, KITTI and VOC datasets. From Top to Bottom: ASA, BR-BP and UE metrics. Except UE, higher values indicate that the algorithm is more effective.
  • Figure 5: Component analysis. From left to right: ASA score on the BSDS and NYU datasets.
  • ...and 4 more figures