SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, Chu-Song Chen
TL;DR
SAM4MLLM addresses the pixel-level Referring Expression Segmentation problem by coupling a powerful Multi-Modal LLM with the Segment Anything Model, enabling fine-grained localization without changing the MLLM architecture or adding new loss terms. It encodes segmentation masks as discrete prompts and prompts SAM through two strategies—PPG and PQPP—thus leveraging SAM’s precise mask generation while preserving the LLM’s language-driven reasoning. Training uses RES, GRES, and VQA data with LoRA-based fine-tuning, achieving strong results on RES and GRES with far less data than prior methods, and demonstrating zero-shot capability on ReasonSeg. The approach maintains VQA performance and offers a simple, generalizable path to pixel-accurate segmentation by exploiting the complementary strengths of SAM and MLLMs, with potential for extension to broader visual reasoning tasks in the future.
Abstract
We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.
