Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation
Qilong Zhangli, Di Liu, Abhishek Aich, Dimitris Metaxas, Samuel Schulter
TL;DR
RESI addresses the core challenge of inconsistent semantics when training segmentation models on multiple datasets by unifying label spaces through CLIP-based language embeddings and introducing label-space-specific query embeddings (LSQEs) within Mask2Former. The approach decouples dataset-specific information from shared semantics, enabling flexible inference over any combination of training-label spaces. To further improve handling of overlaps in mixed spaces, RESI integrates ESF-OMI post-processing and provides a unified evaluation metric, PIQ, that blends panoptic and instance segmentation aspects. Across four mixed-label benchmarks, RESI consistently outperforms baselines on semantic mIoU, panoptic PQ, instance AP, and PIQ, demonstrating robust cross-dataset generalization and practical applicability for scalable segmentation.
Abstract
Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.
