Trapped in texture bias? A large scale comparison of deep instance segmentation
Johannes Theodoridis, Jessica Hofmann, Johannes Maucher, Andreas Schilling
TL;DR
The paper investigates whether instance segmentation models exhibit texture-based bias and how design choices affect robustness to out-of-distribution texture. It deploys Stylized COCO to create controlled, object-centric texture variations and conducts a large-scale, cross-model comparison across 68 pre-trained models, 61 stylized replicas, and multiple evaluation settings, using the relative metric $rP_{\alpha} = P_{\alpha} / P_{COCO}$. The results show that architecture and depth—particularly deeper backbones, deformable necks, and dynamic architectures—drive robustness more than pre-training or standard augmentation, with YOLACT++, SOTR, and SOLOv2 leading in texture-robust performance on larger objects. Overall, while texture bias exists in many segmentation methods, the study identifies concrete design directions to mitigate it and establish a robust baseline for future work.
Abstract
Do deep learning models for instance segmentation generalize to novel objects in a systematic way? For classification, such behavior has been questioned. In this study, we aim to understand if certain design decisions such as framework, architecture or pre-training contribute to the semantic understanding of instance segmentation. To answer this question, we consider a special case of robustness and compare pre-trained models on a challenging benchmark for object-centric, out-of-distribution texture. We do not introduce another method in this work. Instead, we take a step back and evaluate a broad range of existing literature. This includes Cascade and Mask R-CNN, Swin Transformer, BMask, YOLACT(++), DETR, BCNet, SOTR and SOLOv2. We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks. In addition, we show that deeper and dynamic architectures improve robustness whereas training schedules, data augmentation and pre-training have only a minor impact. In summary we evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations.
