X-SAM: From Segment Anything to Any Segmentation
Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang
TL;DR
X-SAM proposes a unified segmentation MLLM that extends SAM from segmenting 'anything' to 'any segmentation' by enabling text and vision prompts and introducing Visual GrounDed (VGD) segmentation. The model fuses dual encoders, dual projectors, a language model, a segmentation connector, and a redesigned SAM-based decoder to output masks via a <SEG> token, trained through a three-stage process (segmentor fine-tuning, alignment pre-training, mixed fine-tuning) with dataset balance resampling. Across more than twenty segmentation datasets and seven tasks, X-SAM achieves state-of-the-art results with a single model, including competitive generic/OV segmentation, superior VGD and GCG segmentation, and strong interactive and referencing capabilities. This approach advances practical pixel-level understanding in multimodal systems and lays groundwork for broader future work in video and cross-image grounding.
Abstract
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.
