Table of Contents
Fetching ...

Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology

Shamik Basu, Luc Van Gool, Christos Sakaridis

TL;DR

InSeIn addresses the problem of infeasible high-level spatial relations in semantic segmentation by data-drivenly extracting feasible and infeasible class inclusions from training data and enforcing feasibility with a differentiable morphological loss. The method adds a novel inclusion loss to standard cross-entropy, computed via a differentiable area-opening procedure on softmax score maps, and is used as a plug‑in to various state-of-the-art networks. Empirically, InSeIn yields consistent mIoU improvements across Cityscapes, ADE20K, and ACDC, while substantially reducing infeasible inclusions as measured by the mINF metric and lowering false response errors. The approach is lightweight (training only) and operates without learned parameters beyond the loss weight, promising practical impact for robust semantic segmentation under domain shift and complex scene layouts.

Abstract

State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel or per-segment classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label "road" to a segment that is included by another segment that is respectively labeled as "sky". However, the ground truth of the existing dataset at hand dictates that such inclusion is not feasible. Our method, Infeasible Semantic Inclusions (InSeIn), first extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training set at hand in an offline, data-driven fashion, and then enforces a morphological yet differentiable loss that penalizes violations of these constraints during training to promote prediction feasibility. InSeIn is a light-weight plug-and-play method, constitutes a novel step towards minimizing infeasible semantic inclusions in the predictions of learned segmentation models, and yields consistent and significant performance improvements over diverse state-of-the-art networks across the ADE20K, Cityscapes, and ACDC datasets. https://github.com/SHAMIK-97/InSeIn

Optimizing against Infeasible Inclusions from Data for Semantic Segmentation through Morphology

TL;DR

InSeIn addresses the problem of infeasible high-level spatial relations in semantic segmentation by data-drivenly extracting feasible and infeasible class inclusions from training data and enforcing feasibility with a differentiable morphological loss. The method adds a novel inclusion loss to standard cross-entropy, computed via a differentiable area-opening procedure on softmax score maps, and is used as a plug‑in to various state-of-the-art networks. Empirically, InSeIn yields consistent mIoU improvements across Cityscapes, ADE20K, and ACDC, while substantially reducing infeasible inclusions as measured by the mINF metric and lowering false response errors. The approach is lightweight (training only) and operates without learned parameters beyond the loss weight, promising practical impact for robust semantic segmentation under domain shift and complex scene layouts.

Abstract

State-of-the-art semantic segmentation models are typically optimized in a data-driven fashion, minimizing solely per-pixel or per-segment classification objectives on their training data. This purely data-driven paradigm often leads to absurd segmentations, especially when the domain of input images is shifted from the one encountered during training. For instance, state-of-the-art models may assign the label "road" to a segment that is included by another segment that is respectively labeled as "sky". However, the ground truth of the existing dataset at hand dictates that such inclusion is not feasible. Our method, Infeasible Semantic Inclusions (InSeIn), first extracts explicit inclusion constraints that govern spatial class relations from the semantic segmentation training set at hand in an offline, data-driven fashion, and then enforces a morphological yet differentiable loss that penalizes violations of these constraints during training to promote prediction feasibility. InSeIn is a light-weight plug-and-play method, constitutes a novel step towards minimizing infeasible semantic inclusions in the predictions of learned segmentation models, and yields consistent and significant performance improvements over diverse state-of-the-art networks across the ADE20K, Cityscapes, and ACDC datasets. https://github.com/SHAMIK-97/InSeIn
Paper Structure (17 sections, 6 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: A traffic light segment includes a building segment in a semantic annotation from the ground truth of the dataset, which is a feasible inclusion. Right: A bus segment includes traffic sign segments (in the red boxes) in a semantic prediction, which is an infeasible inclusion. Our method addresses such physical infeasibilities in semantic segmentation. Best viewed on a screen and zoomed in.
  • Figure 2: Overview of InSeIn. Top left: the complete network architecture, where the standard cross-entropy loss $l_{\text{ce}}$ from the baseline network $\phi$ is added to the inclusion loss $l_{\text{inclusion}}$ computed by InSeIn. cf. Sec. \ref{['sec:method overview']}. Bottom: The pipeline of InSeIn, cf. Sec. \ref{['sec: InSeIn']}. For the softmax outputs $P_{\phi}(I)$ of the network and for each pair of classes $(c_i,c_j) \in \mathcal{C'}$ that signifies an infeasible inclusion where $c_i$ cannot include $c_j$, we take the difference of the respective softmax scores, $\text{P}_{\phi_{c_j}}(I)- \text{P}_{\phi_{c_i}}(I)$, and rectify it with a $\mathop{\mathrm{ReLU}}\nolimits$. After concatenating all such $\mathcal{C'}$ rectified difference maps channel-wise into $\hat{P}_{\phi}(I)$, we negate the latter and set its border pixels to 1. After that, an iterative operation for area opening is performed $T$ times on $\hat{P}_{\phi}(I)$. In each iteration, we perform max-pooling with a 3x3 kernel and a stride of 1, and the result is multiplied with $\hat{P}_{\phi}(I)$ element-wise. The final area-opened tensor channels differ from their counterparts in $\hat{P}_{\phi}(I)$ only across regions of incorrect inclusions. This element-wise difference is stored in $\mathbf{H}(I)$ and the $L_1$ norm of the latter, capturing both the spatial extent and the intensity of infeasible inclusions, constitutes the inclusion loss $l_{\text{inclusion}}$ employed in InSeIn.
  • Figure 3: Visualization of feature maps within InSeIn for a class pair corresponding to an infeasible inclusion. In this example semantic prediction of SegFormer xie2021segformer on Cityscapes Cordts_2016_CVPR, bus infeasibly includes traffic sign. In the blue frame, we first show the difference map between the softmax scores of the two classes using the coolwarm colormap, where blue tones indicate larger scores for traffic sign and red for bus, respectively. We rectify the difference between the two softmax maps and show only the respective channel of the 3D tensor $\hat{P}_{\phi}(I)$, where black indicates zeros and red tones indicate positive values. The area opening operation is performed on $\hat{P}_{\phi}(I)$ and the regions in which $\hat{P}_{\phi}(I)$ is positive (red) but which are not connected to the border, i.e. traffic sign segments infeasibly included in bus segments, are opened. Exactly these segments are isolated in the final tensor $\mathbf{H}(\mathbf{I})$ and are used to compute our inclusion loss $l_{\text{inclusion}}$.
  • Figure 4: Qualitative comparison on Cityscapes. Best viewed on a screen with full zoom. From left to right: input image, ground-truth semantic labels, prediction of Mask2Former Cheng2022Mask2Former network and InSeIn.
  • Figure 5: Qualitative comparison on ADE20K. Best viewed on a screen with full zoom. From left to right: input image, ground-truth semantic labels, and predictions of OCRNet YuanCW19OCRNet and InSeIn.
  • ...and 5 more figures