Table of Contents
Fetching ...

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu, Mingkuan Feng, Long Chen

TL;DR

STAMP introduces all-mask prediction to decouple autoregressive dialogue from non-autoregressive mask generation in MLLMs, solving the segmentation trilemma of preserving dialogue, achieving strong segmentation, and enabling fast inference. The two-phase pipeline first generates a textual response and signals segmentation, then predicts all mask tokens in a single forward pass via patch-wise classification with hybrid attention, yielding dense masks efficiently. End-to-end training combines text-generation loss with patch-level mask loss, enabling joint optimization that preserves language abilities while delivering high-quality segmentation. Across RES, ReasonSeg, and VQA tasks, STAMP achieves state-of-the-art results with inference speeds comparable to embedding-based methods, validating the practicality of all-mask prediction for unified visual-language models.

Abstract

Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

TL;DR

STAMP introduces all-mask prediction to decouple autoregressive dialogue from non-autoregressive mask generation in MLLMs, solving the segmentation trilemma of preserving dialogue, achieving strong segmentation, and enabling fast inference. The two-phase pipeline first generates a textual response and signals segmentation, then predicts all mask tokens in a single forward pass via patch-wise classification with hybrid attention, yielding dense masks efficiently. End-to-end training combines text-generation loss with patch-level mask loss, enabling joint optimization that preserves language abilities while delivering high-quality segmentation. Across RES, ReasonSeg, and VQA tasks, STAMP achieves state-of-the-art results with inference speeds comparable to embedding-based methods, validating the practicality of all-mask prediction for unified visual-language models.

Abstract

Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.

Paper Structure

This paper contains 27 sections, 2 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: The trilemma of segmentation in MLLMs. Embedding prediction may harm dialogue abilities. Next-token prediction methods are either fast with poor segmentation performance or achieve superior performance at the cost of inference speed, particularly when generating rich content (e.g., chain-of-thought or patch-wise classification).
  • Figure 2: Comparison of MLLM-based segmentation paradigms.(a) Embedding Prediction: A conflicting pixel-level objective lai2024lisaren2024pixellm degrades the MLLM's general dialogue capabilities. (b) Next-token Prediction: Generates masks autoregressively wang2023visionllmliu2025seglan2024text4seg, forcing a trade-off between poor segmentation performance (for sparse outputs) and slow inference (for rich outputs). (c) Our All-mask Prediction: We decouple dialogue generation (autoregressive) from mask generation (non-autoregressive). By simultaneously predicting all mask tokens as patch-wise classifications in a single pass, our paradigm resolves the segmentation trilemma, uniting preserved dialogue abilities, high segmentation performance and fast inference speed.
  • Figure 3: The STAMP Pipeline.Phase 1 (Dialogue Generation): The MLLM autoregressively generates a conversational response, emitting a special <SEG> token to trigger mask generation. Phase 2 (All-mask Generation): Triggered by the <SEG> token, a sequence of [MASK] placeholders, each fused with its corresponding image patch feature, is processed. A single non-autoregressive forward pass with hybrid attention then predicts all mask tokens simultaneously.
  • Figure 4: Showcase of STAMP against competing methods.
  • Figure 5: Efficiency comparison. Methods from the same paradigm are grouped by color. Numbers following the model names denote the input resolution.
  • ...and 6 more figures