Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Jiazhen Liu, Mingkuan Feng, Long Chen
TL;DR
STAMP introduces all-mask prediction to decouple autoregressive dialogue from non-autoregressive mask generation in MLLMs, solving the segmentation trilemma of preserving dialogue, achieving strong segmentation, and enabling fast inference. The two-phase pipeline first generates a textual response and signals segmentation, then predicts all mask tokens in a single forward pass via patch-wise classification with hybrid attention, yielding dense masks efficiently. End-to-end training combines text-generation loss with patch-level mask loss, enabling joint optimization that preserves language abilities while delivering high-quality segmentation. Across RES, ReasonSeg, and VQA tasks, STAMP achieves state-of-the-art results with inference speeds comparable to embedding-based methods, validating the practicality of all-mask prediction for unified visual-language models.
Abstract
Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.
