Mixed-Query Transformer: A Unified Image Segmentation Architecture
Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto
TL;DR
MQ-Former tackles universal image segmentation by unifying multiple tasks and datasets under a single model with a novel mixed query mechanism that blends learnable and conditional queries and uses Hungarian matching for objective alignment. The approach supports open vocabulary via a text encoder and augments training with synthetic masks and captions to improve generalization, formalized by a joint loss $L = L_c + L_b + L_m$. Empirically, MQ-Former delivers competitive results across semantic, instance, panoptic, referring, and foreground segmentation, while achieving notably better open-set generalization, exemplified by a gain of over 7 points on the open-vocabulary SeginW benchmark. This work demonstrates the practical viability of fully unified segmentation models and highlights synthetic data as a scalable, cost-effective means to expand training data without task-specific architecture changes.
Abstract
Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights. To enable this, we propose a mixed query strategy, which can effectively and dynamically accommodate different types of objects without heuristic designs. In addition, the unified architecture allows us to use data augmentation with synthetic masks and captions to further improve model generalization. Experiments demonstrate that MQ-Former can not only effectively handle multiple segmentation datasets and tasks compared to specialized state-of-the-art models with competitive performance, but also generalize better to open-set segmentation tasks, evidenced by over 7 points higher performance than the prior art on the open-vocabulary SeginW benchmark.
