Table of Contents
Fetching ...

Trapped in texture bias? A large scale comparison of deep instance segmentation

Johannes Theodoridis, Jessica Hofmann, Johannes Maucher, Andreas Schilling

TL;DR

The paper investigates whether instance segmentation models exhibit texture-based bias and how design choices affect robustness to out-of-distribution texture. It deploys Stylized COCO to create controlled, object-centric texture variations and conducts a large-scale, cross-model comparison across 68 pre-trained models, 61 stylized replicas, and multiple evaluation settings, using the relative metric $rP_{\alpha} = P_{\alpha} / P_{COCO}$. The results show that architecture and depth—particularly deeper backbones, deformable necks, and dynamic architectures—drive robustness more than pre-training or standard augmentation, with YOLACT++, SOTR, and SOLOv2 leading in texture-robust performance on larger objects. Overall, while texture bias exists in many segmentation methods, the study identifies concrete design directions to mitigate it and establish a robust baseline for future work.

Abstract

Do deep learning models for instance segmentation generalize to novel objects in a systematic way? For classification, such behavior has been questioned. In this study, we aim to understand if certain design decisions such as framework, architecture or pre-training contribute to the semantic understanding of instance segmentation. To answer this question, we consider a special case of robustness and compare pre-trained models on a challenging benchmark for object-centric, out-of-distribution texture. We do not introduce another method in this work. Instead, we take a step back and evaluate a broad range of existing literature. This includes Cascade and Mask R-CNN, Swin Transformer, BMask, YOLACT(++), DETR, BCNet, SOTR and SOLOv2. We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks. In addition, we show that deeper and dynamic architectures improve robustness whereas training schedules, data augmentation and pre-training have only a minor impact. In summary we evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations.

Trapped in texture bias? A large scale comparison of deep instance segmentation

TL;DR

The paper investigates whether instance segmentation models exhibit texture-based bias and how design choices affect robustness to out-of-distribution texture. It deploys Stylized COCO to create controlled, object-centric texture variations and conducts a large-scale, cross-model comparison across 68 pre-trained models, 61 stylized replicas, and multiple evaluation settings, using the relative metric . The results show that architecture and depth—particularly deeper backbones, deformable necks, and dynamic architectures—drive robustness more than pre-training or standard augmentation, with YOLACT++, SOTR, and SOLOv2 leading in texture-robust performance on larger objects. Overall, while texture bias exists in many segmentation methods, the study identifies concrete design directions to mitigate it and establish a robust baseline for future work.

Abstract

Do deep learning models for instance segmentation generalize to novel objects in a systematic way? For classification, such behavior has been questioned. In this study, we aim to understand if certain design decisions such as framework, architecture or pre-training contribute to the semantic understanding of instance segmentation. To answer this question, we consider a special case of robustness and compare pre-trained models on a challenging benchmark for object-centric, out-of-distribution texture. We do not introduce another method in this work. Instead, we take a step back and evaluate a broad range of existing literature. This includes Cascade and Mask R-CNN, Swin Transformer, BMask, YOLACT(++), DETR, BCNet, SOTR and SOLOv2. We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks. In addition, we show that deeper and dynamic architectures improve robustness whereas training schedules, data augmentation and pre-training have only a minor impact. In summary we evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations.
Paper Structure (14 sections, 3 equations, 10 figures, 1 table)

This paper contains 14 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Left: Simplified creation process of the Stylized COCO dataset. Style images are randomly chosen from Kaggles Painter by Numbers dataset. Right: We use mask annotations to create counterfactual, object-centric versions of Stylized COCO. We include more examples of the creation process in the supplementary material
  • Figure 2: Depending on the style image, object boundaries can vanish due to strong stylization. The Stylized Objects and Background versions of Stylized COCO resolve this issue
  • Figure 3: Top row: Comparison of COCO and Stylized COCO at different alphas. The AdaIN method introduces subtle artifacts even at $\alpha=0$ (no style). Bottom left: We control the style strength in feature space (yellow to pink) and pixel space (blue to pink). Every alpha depicts a complete version of the accordingly styled val2017 subset. Bottom right: Comparison of image gradients and color histograms at different alphas
  • Figure 4: Left: Average structural similarity between image gradients in relation to COCO (a score of 1 means that there is no difference between images). Right: Wasserstein distance between RGB histograms (reversed y-axis)
  • Figure 5: Absolute performances on COCO val2017. Training schedules in epochs have been appended to model names. Note that Yolo is bounding box AP which is not comparable but included for model completeness. Methods that did not report scores for val2017 have been validated on test-dev2017 first
  • ...and 5 more figures