On the Influence of Shape, Texture and Color for Learning Semantic Segmentation
Annika Mütze, Natalie Grabowsky, Edgar Heinert, Matthias Rottmann, Hanno Gottschalk
TL;DR
The paper probes how semantic segmentation learning is influenced by shape, texture, and color cues by constructing cue-specific datasets and training dedicated cue experts across Cityscapes, CARLA, and PASCAL Context. It introduces a generic cue-decomposition pipeline (covering S, T, V, HS) and both early cue-based dataset generation and late pixel-wise cue fusion to study where cues contribute during training. Key findings show that no single cue dominates learning; however, combining shape and color often yields strong performance, especially for small objects and boundaries, with similar cue-ordering observed in CNNs and transformers. The study provides a framework for cue-based analysis, offers insights into robustness and safety through interpretable cue interactions, and suggests several future directions, including panoptic segmentation and multispectral sensing.
Abstract
Recent research has investigated the shape and texture biases of pre-trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue-specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Experiments on Cityscapes, PASCAL Context, and a synthetic CARLA dataset show that while no single cue dominates, the shape + color expert predominantly improves the prediction of small objects and border pixels. The cue performance order is consistent for the tested convolutional and transformer architecture, indicating similar cue extraction capabilities, although pre-trained transformers are said to be more biased towards shape than convolutional neural networks.
