Table of Contents
Fetching ...

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

Hao Zhang, Fang Li, Lu Qi, Ming-Hsuan Yang, Narendra Ahuja

TL;DR

CSL addresses Out-Of-Distribution segmentation and Zero-Shot Semantic Segmentation by introducing a class-agnostic structure-constrained learning framework that can plug into existing segmentation methods. It offers two integration schemes: Scheme 1 distills from a base teacher with structure constraints, and Scheme 2 applies structure constraints during inference without retraining. Key innovations include soft assignment for region-to-pixel mapping, mask split preprocessing to reduce bias from seen classes, and a structure-constrained fusion step that combines per-pixel distributions with region proposals. Empirically, CSL yields consistent improvements across OOD segmentation, ZS3, and domain adaptation benchmarks, often surpassing state-of-the-art baselines.

Abstract

Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. We propose soft assignment and mask split methodologies that enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation Including the Unseen

TL;DR

CSL addresses Out-Of-Distribution segmentation and Zero-Shot Semantic Segmentation by introducing a class-agnostic structure-constrained learning framework that can plug into existing segmentation methods. It offers two integration schemes: Scheme 1 distills from a base teacher with structure constraints, and Scheme 2 applies structure constraints during inference without retraining. Key innovations include soft assignment for region-to-pixel mapping, mask split preprocessing to reduce bias from seen classes, and a structure-constrained fusion step that combines per-pixel distributions with region proposals. Empirically, CSL yields consistent improvements across OOD segmentation, ZS3, and domain adaptation benchmarks, often surpassing state-of-the-art baselines.

Abstract

Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes. Existing strategies adapt the class-agnostic Mask2Former (CA-M2F) tailored to specific tasks. However, these methods cater to singular tasks, demand training from scratch, and we demonstrate certain deficiencies in CA-M2F, which affect performance. We propose the Class-Agnostic Structure-Constrained Learning (CSL), a plug-in framework that can integrate with existing methods, thereby embedding structural constraints and achieving performance gain, including the unseen, specifically OOD, ZS3, and domain adaptation (DA) tasks. There are two schemes for CSL to integrate with existing methods (1) by distilling knowledge from a base teacher network, enforcing constraints across training and inference phrases, or (2) by leveraging established models to obtain per-pixel distributions without retraining, appending constraints during the inference phase. We propose soft assignment and mask split methodologies that enhance OOD object segmentation. Empirical evaluations demonstrate CSL's prowess in boosting the performance of existing algorithms spanning OOD segmentation, ZS3, and DA segmentation, consistently transcending the state-of-art across all three tasks.
Paper Structure (24 sections, 5 equations, 4 figures, 7 tables)

This paper contains 24 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our CSL framework. CSL consists of a backbone, a pixel decoder, a transformer decoder, a base teacher network, and MLPs. $N$ learnable region queries and the image features are fed to the transformer decoder and MLPs, to obtain $N$ pairs of latent region prototypes and their validity scores. We calculate the normalized similarity between each element $\mathcal{E}_{h,w}$ of per-pixel embeddings and each $\mathbf{P}_n, n\in \{1,2,...N\}$ by simple dot production followed by the sigmoid function to get $N$ region scores. The validity scores indicate the degree of the region prototypes are valid for given images. During training, a valid loss, a region loss, and a distillation loss are used to optimize the model. Instead of assigning each pixel from the input image to one of the prior fixed classes, CSL assigns it to one of $N$ learnable region prototypes by our proposed soft assignment. During inference, we introduce structure-constrained Fusion to calculate the final prediction.
  • Figure 2: Visualisations for hard mask predictions at TPR=95% and per-pixel Out-of-distribution (OOD) scores. We compare the results of SML, and ObsNet with our proposed CSL. In the hard mask predictions, white and gray indicate being predicted to be OOD. In the OOD scores, the red and blue intensity values correspond to the magnitudes of the OOD scores above and below the decision boundary, respectively. (d) shows the region proposals from our CSL with scheme$_2$.
  • Figure 3: Visualisations of the efficacy of CA-training and soft assignment. CA+HA represents CA-M2F, where the model is trained in a class-agnostic way and inferences via the hard assignment, None-CA represents the model is trained with class loss and inferences via the soft assignment, and CA+SA represents CSL, where the model trained in a class-agnostic way and inferences via the soft assignment. Note that the model is only trained on the Cityscapes and tested on COCO-stuff.
  • Figure 4: Embedding visualisations of Figure \ref{['fig6']}-(e) by T-SEN. We plot the region prototypes as red times symbols, the per-pixel embeddings from the background as green bullets, and the skier as blue bullets. The sizes of the prototypes indicate the validity scores.