Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

Qilong Zhangli; Di Liu; Abhishek Aich; Dimitris Metaxas; Samuel Schulter

Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

Qilong Zhangli, Di Liu, Abhishek Aich, Dimitris Metaxas, Samuel Schulter

TL;DR

RESI addresses the core challenge of inconsistent semantics when training segmentation models on multiple datasets by unifying label spaces through CLIP-based language embeddings and introducing label-space-specific query embeddings (LSQEs) within Mask2Former. The approach decouples dataset-specific information from shared semantics, enabling flexible inference over any combination of training-label spaces. To further improve handling of overlaps in mixed spaces, RESI integrates ESF-OMI post-processing and provides a unified evaluation metric, PIQ, that blends panoptic and instance segmentation aspects. Across four mixed-label benchmarks, RESI consistently outperforms baselines on semantic mIoU, panoptic PQ, instance AP, and PIQ, demonstrating robust cross-dataset generalization and practical applicability for scalable segmentation.

Abstract

Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi-dataset training approach by integrating language-based embeddings of class names and label space-specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.

Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 13 figures, 5 tables)

This paper contains 35 sections, 2 equations, 13 figures, 5 tables.

Introduction
Related Work
Image Segmentation:
Scaling Data for Segmentation Models:
Method
Motivation
Preliminaries: Model Framework
Language Embeddings as Classifiers
Label-space Specific Query Embeddings
Inconsistent Semantics.
Naive approach.
Our solution.
Underlying intuition.
Inference with LSQEs
Experiments
...and 20 more sections

Figures (13)

Figure 1: Leveraging multiple datasets for training segmentation models increases robustness and semantic understanding. However, existing methods (1) fail to capture the full masks of objects, such as "person" category, (2) often predict incorrect labels, for instance, mistaking "legs" for "pants". This issue is caused by unexpected conflicts in multiple label spaces, although each dataset (A and B) has consistent ground truth mask layouts that provide non-overlapping and mutually-exclusive semantics.
Figure 2: Overview of our proposed framework: Training. We build upon Mask2Former cheng2021mask2former and replace the fixed-label space classifier with language-based embeddings. We introduce learnable label space-specific query embeddings (LSQE) that are added to the decoder in order to handle conflicting label spaces that arise in the multi-dataset setting. Inference. Given a new label space for inference -- any combination of the categories of the training datasets -- the model first predicts what training label spaces can "serve" the test-label space by matching the text-embeddings of the class names. This process selects the LSQEs that are needed for inference. Then, the decoder of the model runs for each selected LSQE -- at most $K$, the number of training datasets.
Figure 3: Examples of the mixed-label space evaluation-only datasets. Each row shows two examples from CIHP/CSP$_{\textrm{P}}$ (left and middle) and one from CIHP/CSP$_{\textrm{M}}$ (right).
Figure 4: Visual comparison of multi-category segmentation performance. We present a overview of RESI's capabilities in handling complex label spaces. The depicted scenarios demonstrate the model's proficiency in simultaneous cross-label space multi-category predictions, a task where traditional segmentation approaches often fall short.
Figure 5: Average PQ w.r.t. total training cost for three panoptic segmentation dataset(COCO, ADE20K, Mapillary Vistas) and two mixed-label space evaluation-only datasets CIHP/CSP$_{\textrm{multi}}$.
...and 8 more figures

Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

TL;DR

Abstract

Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)