SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Yi-Chia Chen; Wei-Hua Li; Cheng Sun; Yu-Chiang Frank Wang; Chu-Song Chen

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, Chu-Song Chen

TL;DR

SAM4MLLM addresses the pixel-level Referring Expression Segmentation problem by coupling a powerful Multi-Modal LLM with the Segment Anything Model, enabling fine-grained localization without changing the MLLM architecture or adding new loss terms. It encodes segmentation masks as discrete prompts and prompts SAM through two strategies—PPG and PQPP—thus leveraging SAM’s precise mask generation while preserving the LLM’s language-driven reasoning. Training uses RES, GRES, and VQA data with LoRA-based fine-tuning, achieving strong results on RES and GRES with far less data than prior methods, and demonstrating zero-shot capability on ReasonSeg. The approach maintains VQA performance and offers a simple, generalizable path to pixel-accurate segmentation by exploiting the complementary strengths of SAM and MLLMs, with potential for extension to broader visual reasoning tasks in the future.

Abstract

We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

TL;DR

Abstract

Paper Structure (42 sections, 6 figures, 10 tables)

This paper contains 42 sections, 6 figures, 10 tables.

Introduction
Related Works
Referring Expression Segmentation.
Image Segmentation and Segment Anything.
Multimodal Large Language Models (MLLMs).
MLLMs Toward Segmentation.
Method
Encode Segmentation Mask into SAM Prompt
Prompting SAM Using MLLM
SAM4MLLM-PPG.
SAM4MLLM-PQPP.
RES Training
Referring Expression Segmentation Datasets (RES dataset):
Generalized Referring Expression Segmentation (GRES) Dataset liu2023gres:
Visual Question Answering (VQA):
...and 27 more sections

Figures (6)

Figure 1: Architecture diagram of SAM4MLLM-PPG. (a) The training process of PPG, (b) The inference process of PPG, (c) SAM as filter.
Figure 1: (a) off-the-shelf SAM (b) finetuned SAM (c) GT
Figure 2: Architecture diagram of SAM4MLLM-PQPP. (a) The training process of PQPP, (b) The inference process of PQPP.
Figure 2: Visaul comparisons between (a) Ground Truth, (b) SAM4MLLM(ours), (c) GLaMM rasheed2023glamm and (d) LISA lai2023lisa on RefCOCOg.
Figure 3: Qualitative examples of SAM4MLLM on RES (top) and GRES (bottom) tasks. See \ref{['ssec:main_results']} for detailed description.
...and 1 more figures

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

TL;DR

Abstract

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)