Mixed-Query Transformer: A Unified Image Segmentation Architecture

Pei Wang; Zhaowei Cai; Hao Yang; Ashwin Swaminathan; R. Manmatha; Stefano Soatto

Mixed-Query Transformer: A Unified Image Segmentation Architecture

Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto

TL;DR

MQ-Former tackles universal image segmentation by unifying multiple tasks and datasets under a single model with a novel mixed query mechanism that blends learnable and conditional queries and uses Hungarian matching for objective alignment. The approach supports open vocabulary via a text encoder and augments training with synthetic masks and captions to improve generalization, formalized by a joint loss $L = L_c + L_b + L_m$. Empirically, MQ-Former delivers competitive results across semantic, instance, panoptic, referring, and foreground segmentation, while achieving notably better open-set generalization, exemplified by a gain of over 7 points on the open-vocabulary SeginW benchmark. This work demonstrates the practical viability of fully unified segmentation models and highlights synthetic data as a scalable, cost-effective means to expand training data without task-specific architecture changes.

Abstract

Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights. To enable this, we propose a mixed query strategy, which can effectively and dynamically accommodate different types of objects without heuristic designs. In addition, the unified architecture allows us to use data augmentation with synthetic masks and captions to further improve model generalization. Experiments demonstrate that MQ-Former can not only effectively handle multiple segmentation datasets and tasks compared to specialized state-of-the-art models with competitive performance, but also generalize better to open-set segmentation tasks, evidenced by over 7 points higher performance than the prior art on the open-vocabulary SeginW benchmark.

Mixed-Query Transformer: A Unified Image Segmentation Architecture

TL;DR

. Empirically, MQ-Former delivers competitive results across semantic, instance, panoptic, referring, and foreground segmentation, while achieving notably better open-set generalization, exemplified by a gain of over 7 points on the open-vocabulary SeginW benchmark. This work demonstrates the practical viability of fully unified segmentation models and highlights synthetic data as a scalable, cost-effective means to expand training data without task-specific architecture changes.

Abstract

Paper Structure (21 sections, 1 equation, 9 figures, 10 tables)

This paper contains 21 sections, 1 equation, 9 figures, 10 tables.

Introduction
Related Work
Method
MQ-Former Architecture
Object Query Strategies
Unified Segmentation Training
Enhancement with Synthetic Data
Experiments
Comparison among Different Query Strategies
Enhancement by Synthetic Data
Enhancement by Scaling up Datasets and Tasks
Comparison with the state-of-the-art
Conclusion
Full Results of SeginW
Qualitative Results
...and 6 more sections

Figures (9)

Figure 1: The supported tasks of MQ-Former. Within one single training configuration, MQ-Former supports jointly training on multiple segmentation tasks and datasets, and inference with open-vocabulary setting for multiple tasks.
Figure 2: The overview of MQ-Former architecture. The model takes an image and a list of textual language prompts as input and outputs their corresponding localized segment masks.
Figure 3: The comparison of different query strategies. Square with diagonal slashes: learnable query; solid square: conditional query; circle with slashes: query embedding of learnable queries; solid circle: query embedding of conditional queries; triangle with slashes: ground truth of stuff class; solid triangle: ground truth of thing classes. (a) learnable query is learned from scratch. (b) conditional query is derived and selected from encoder. (c) separated query consists of both learnable and conditional queries which are associated with stuff and thing classes respectively. (d) mixed query also consists of both types of queries but does not impose thing/stuff distinction.
Figure 4: Synthetic data visualization. Upper: synthetic masks by SAM; Lower: synthetic captions by OFA-akin model.
Figure 5: The benefits of dynamic query selection of mixed query strategy. Upper: the stuff objects are predicted with conditional queries instead of learnable queries; Lower: the thing objects are predicted with learnable queries instead of conditional queries.
...and 4 more figures

Mixed-Query Transformer: A Unified Image Segmentation Architecture

TL;DR

Abstract

Mixed-Query Transformer: A Unified Image Segmentation Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (9)