Table of Contents
Fetching ...

X-SAM: From Segment Anything to Any Segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang

TL;DR

X-SAM proposes a unified segmentation MLLM that extends SAM from segmenting 'anything' to 'any segmentation' by enabling text and vision prompts and introducing Visual GrounDed (VGD) segmentation. The model fuses dual encoders, dual projectors, a language model, a segmentation connector, and a redesigned SAM-based decoder to output masks via a <SEG> token, trained through a three-stage process (segmentor fine-tuning, alignment pre-training, mixed fine-tuning) with dataset balance resampling. Across more than twenty segmentation datasets and seven tasks, X-SAM achieves state-of-the-art results with a single model, including competitive generic/OV segmentation, superior VGD and GCG segmentation, and strong interactive and referencing capabilities. This approach advances practical pixel-level understanding in multimodal systems and lays groundwork for broader future work in video and cross-image grounding.

Abstract

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

X-SAM: From Segment Anything to Any Segmentation

TL;DR

X-SAM proposes a unified segmentation MLLM that extends SAM from segmenting 'anything' to 'any segmentation' by enabling text and vision prompts and introducing Visual GrounDed (VGD) segmentation. The model fuses dual encoders, dual projectors, a language model, a segmentation connector, and a redesigned SAM-based decoder to output masks via a <SEG> token, trained through a three-stage process (segmentor fine-tuning, alignment pre-training, mixed fine-tuning) with dataset balance resampling. Across more than twenty segmentation datasets and seven tasks, X-SAM achieves state-of-the-art results with a single model, including competitive generic/OV segmentation, superior VGD and GCG segmentation, and strong interactive and referencing capabilities. This approach advances practical pixel-level understanding in multimodal systems and lays groundwork for broader future work in video and cross-image grounding.

Abstract

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

Paper Structure

This paper contains 21 sections, 3 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Illustration of Performance of X-SAM on Image Segmentation Benchmarks. X-SAM consistently surpasses existing Multimodal Large Language Models (MLLMs) across all evaluated segmentation benchmarks.
  • Figure 2: Illustration of the capabilities of X-SAM. (a). Text query tasks: Generic (Gen.), Referring(Ref.), Reasoning(Rea.), and Grounded Conversation Generation(GCG) segmentation, etc.. (b). Vision query tasks: Interactive(Inter.) and Visual GrounDed (VGD) segmentation for single and cross-image.
  • Figure 3: The Overview of X-SAM. X-SAM comprises dual encoders, dual projectors, a language model, a segmentation connector, and a segmentation decoder. The dual encoders process the image and project features to match text embedding dimensions, which are then input to the language model with tokenized text for instruction-guided understanding. The SAM features are connected to the segmentation decoder, which uses the LLM's <SEG> token to generate segmentation masks.
  • Figure 4: The Architecture of Segmentation Connector.
  • Figure 5: The Multi-stage Training of X-SAM. X-SAM performs a multi-stage training process, including segmentor fine-tuning, alignment pre-training, and mixed fine-tuning. Segmentor fine-tuning: train the segmentor on the segmentation datasets to obtain a generalized segmentor. Alignment pre-training: train the dual projectors to align the vision features and the LLM features. Mixed fine-tuning: fine-tune the dual projectors, the segmentation decoder, and the LLM on the mixed datasets.
  • ...and 8 more figures