LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
Junchi Wang, Lei Ke
TL;DR
This work tackles reasoning segmentation by decoupling reasoning from segmentation through a two-stage framework, LLM-Seg, that couples a vision-language model with a frozen segmentation foundation model. Mask proposals from SAM are ranked and selected via a special <SEG> token processed by LLaVA, with a lightweight fusion module and two heads (IoU and IoP) guiding selection, and the model is trained with LoRA to keep parameters small. A scalable data-generation pipeline uses LVIS and EgoObjects with GPT-4 in a prompt-engineered loop to create the LLM-Seg40K dataset, enabling training and evaluation of reasoning-based segmentation. Empirically, LLM-Seg achieves competitive or superior performance to state-of-the-art methods on ReasonSeg and sets a strong benchmark on LLM-Seg40K, while the data pipeline demonstrates efficient automatic generation of high-quality reasoning segmentation pairs. The approach advances open-world segmentation by leveraging reasoning capabilities of LLMs without fine-tuning foundation segmentation models, offering practical impact for robotics, AR, and assistive perception tasks, and providing a new benchmark for future research.
Abstract
Understanding human instructions to identify the target objects is vital for perception systems. In recent years, the advancements of Large Language Models (LLMs) have introduced new possibilities for image segmentation. In this work, we delve into reasoning segmentation, a novel task that enables segmentation system to reason and interpret implicit user intention via large language model reasoning and then segment the corresponding target. Our work on reasoning segmentation contributes on both the methodological design and dataset labeling. For the model, we propose a new framework named LLM-Seg. LLM-Seg effectively connects the current foundational Segmentation Anything Model and the LLM by mask proposals selection. For the dataset, we propose an automatic data generation pipeline and construct a new reasoning segmentation dataset named LLM-Seg40K. Experiments demonstrate that our LLM-Seg exhibits competitive performance compared with existing methods. Furthermore, our proposed pipeline can efficiently produce high-quality reasoning segmentation datasets. The LLM-Seg40K dataset, developed through this pipeline, serves as a new benchmark for training and evaluating various reasoning segmentation approaches. Our code, models and dataset are at https://github.com/wangjunchi/LLMSeg.
