Table of Contents
Fetching ...

On the Influence of Shape, Texture and Color for Learning Semantic Segmentation

Annika Mütze, Natalie Grabowsky, Edgar Heinert, Matthias Rottmann, Hanno Gottschalk

TL;DR

The paper probes how semantic segmentation learning is influenced by shape, texture, and color cues by constructing cue-specific datasets and training dedicated cue experts across Cityscapes, CARLA, and PASCAL Context. It introduces a generic cue-decomposition pipeline (covering S, T, V, HS) and both early cue-based dataset generation and late pixel-wise cue fusion to study where cues contribute during training. Key findings show that no single cue dominates learning; however, combining shape and color often yields strong performance, especially for small objects and boundaries, with similar cue-ordering observed in CNNs and transformers. The study provides a framework for cue-based analysis, offers insights into robustness and safety through interpretable cue interactions, and suggests several future directions, including panoptic segmentation and multispectral sensing.

Abstract

Recent research has investigated the shape and texture biases of pre-trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue-specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Experiments on Cityscapes, PASCAL Context, and a synthetic CARLA dataset show that while no single cue dominates, the shape + color expert predominantly improves the prediction of small objects and border pixels. The cue performance order is consistent for the tested convolutional and transformer architecture, indicating similar cue extraction capabilities, although pre-trained transformers are said to be more biased towards shape than convolutional neural networks.

On the Influence of Shape, Texture and Color for Learning Semantic Segmentation

TL;DR

The paper probes how semantic segmentation learning is influenced by shape, texture, and color cues by constructing cue-specific datasets and training dedicated cue experts across Cityscapes, CARLA, and PASCAL Context. It introduces a generic cue-decomposition pipeline (covering S, T, V, HS) and both early cue-based dataset generation and late pixel-wise cue fusion to study where cues contribute during training. Key findings show that no single cue dominates learning; however, combining shape and color often yields strong performance, especially for small objects and boundaries, with similar cue-ordering observed in CNNs and transformers. The study provides a framework for cue-based analysis, offers insights into robustness and safety through interpretable cue interactions, and suggests several future directions, including panoptic segmentation and multispectral sensing.

Abstract

Recent research has investigated the shape and texture biases of pre-trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue-specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Experiments on Cityscapes, PASCAL Context, and a synthetic CARLA dataset show that while no single cue dominates, the shape + color expert predominantly improves the prediction of small objects and border pixels. The cue performance order is consistent for the tested convolutional and transformer architecture, indicating similar cue extraction capabilities, although pre-trained transformers are said to be more biased towards shape than convolutional neural networks.

Paper Structure

This paper contains 44 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: A sample of cues and cue combinations extracted from the Cityscapes dataset, based on which cue expert models are trained.
  • Figure 2: Extraction process of the texture (T) cue. It consists of the three main steps: class-wise patch extraction, class-wise mosaic image construction and segmentation dataset creation based on Voronoi diagrams.
  • Figure 3: Class-specific cue influence for CNN based $\text{S}_{\text{EED-RGB}}$ and $\text{T}_{\text{RGB}}$ on Cityscapes.
  • Figure 4: Comparison of the prediction of the two experts $\text{S}_{\text{EED-RGB}}$ (left) and $\text{T}_{\text{RGB}}$ (mid) for Cityscapes, CARLA and PASCAL Context. As a reference the ground truth is displayed (right).
  • Figure 5: Coverage of experts over the frequent class 'road' (top) and rare class 'person' (bottom) on the CARLA dataset. The recall on the y-axis is defined by the fraction of pixels in a ground-truth segment covered by a prediction of the correct class.
  • ...and 15 more figures