Table of Contents
Fetching ...

Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang

TL;DR

Text4Seg redefines image segmentation as text generation by introducing semantic descriptors—a sequence of patch-level textual labels that represent an image. This text-as-mask approach enables decoder-free integration with existing MLLMs, significantly simplifying training and scalability. A key efficiency mechanism is Row-wise Run-Length Encoding (R-RLE), which compresses descriptors by about 74% and speeds inference roughly 3x without sacrificing performance, while a SAM-based mask refiner can further boost pixel-level accuracy. Across RES, REC, VQA, and open-vocabulary segmentation tasks, Text4Seg demonstrates competitive or state-of-the-art results on multiple backbones, highlighting its robustness, efficiency, and versatility for vision-centric tasks within multimodal learning.

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16\times16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Text4Seg: Reimagining Image Segmentation as Text Generation

TL;DR

Text4Seg redefines image segmentation as text generation by introducing semantic descriptors—a sequence of patch-level textual labels that represent an image. This text-as-mask approach enables decoder-free integration with existing MLLMs, significantly simplifying training and scalability. A key efficiency mechanism is Row-wise Run-Length Encoding (R-RLE), which compresses descriptors by about 74% and speeds inference roughly 3x without sacrificing performance, while a SAM-based mask refiner can further boost pixel-level accuracy. Across RES, REC, VQA, and open-vocabulary segmentation tasks, Text4Seg demonstrates competitive or state-of-the-art results on multiple backbones, highlighting its robustness, efficiency, and versatility for vision-centric tasks within multimodal learning.

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by , without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Paper Structure

This paper contains 51 sections, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Different paradigms of MLLMs based image segmentation: (a) embeddings-as-mask paradigm that relies on additional segmentation decoder and loss (e.g., LISA lai2024lisa); (b) polygon coordinates for instance segmentation (e.g., VisionLLM wang2024visionllm); (c) our text-as-mask paradigm that relies on semantically consistent text sequences.
  • Figure 2: MLLM architecture.
  • Figure 3: An illustration of semantic descriptors for images and two token compression techniques.
  • Figure 4: Visual instruction data.
  • Figure 5: Text4Seg.
  • ...and 16 more figures