LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Junchi Wang; Lei Ke

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Junchi Wang, Lei Ke

TL;DR

This work tackles reasoning segmentation by decoupling reasoning from segmentation through a two-stage framework, LLM-Seg, that couples a vision-language model with a frozen segmentation foundation model. Mask proposals from SAM are ranked and selected via a special <SEG> token processed by LLaVA, with a lightweight fusion module and two heads (IoU and IoP) guiding selection, and the model is trained with LoRA to keep parameters small. A scalable data-generation pipeline uses LVIS and EgoObjects with GPT-4 in a prompt-engineered loop to create the LLM-Seg40K dataset, enabling training and evaluation of reasoning-based segmentation. Empirically, LLM-Seg achieves competitive or superior performance to state-of-the-art methods on ReasonSeg and sets a strong benchmark on LLM-Seg40K, while the data pipeline demonstrates efficient automatic generation of high-quality reasoning segmentation pairs. The approach advances open-world segmentation by leveraging reasoning capabilities of LLMs without fine-tuning foundation segmentation models, offering practical impact for robotics, AR, and assistive perception tasks, and providing a new benchmark for future research.

Abstract

Understanding human instructions to identify the target objects is vital for perception systems. In recent years, the advancements of Large Language Models (LLMs) have introduced new possibilities for image segmentation. In this work, we delve into reasoning segmentation, a novel task that enables segmentation system to reason and interpret implicit user intention via large language model reasoning and then segment the corresponding target. Our work on reasoning segmentation contributes on both the methodological design and dataset labeling. For the model, we propose a new framework named LLM-Seg. LLM-Seg effectively connects the current foundational Segmentation Anything Model and the LLM by mask proposals selection. For the dataset, we propose an automatic data generation pipeline and construct a new reasoning segmentation dataset named LLM-Seg40K. Experiments demonstrate that our LLM-Seg exhibits competitive performance compared with existing methods. Furthermore, our proposed pipeline can efficiently produce high-quality reasoning segmentation datasets. The LLM-Seg40K dataset, developed through this pipeline, serves as a new benchmark for training and evaluating various reasoning segmentation approaches. Our code, models and dataset are at https://github.com/wangjunchi/LLMSeg.

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 6 figures, 5 tables)

This paper contains 23 sections, 4 equations, 6 figures, 5 tables.

Introduction
Related Work
Method
Model Structure
Mask Proposals
Mask Selection
Model Training
LLM-Seg40K
Dataset Definition
Data Generation Pipeline
Data Sources
Prompt Engineering of GPT-4
Dataset Analysis
Experiments
Evaluation of LLM-Seg
...and 8 more sections

Figures (6)

Figure 1: Our LLM-Seg model integrates multiple foundation models including LLaVA liu2023visual, Segment Anything Model kirillov2023SAM, and DINOv2 oquab2023dinov2. The Segment Anything Model and DINOv2 generate mask proposals and embeddings. LLaVA is responsible for perceiving the input image and question, and it outputs a special $<\text{SEG}>$ token to guide the mask selection module.
Figure 2: Model Structure of our LLM-Seg. The input image will be processed by three different modules. The SAM is responsible for generating binary mask proposals using its Everything Mode. The image encoder extracts features from the image. The vision language model processes the image together with the input queries and uses a special $<\text{SEG}>$ token to represent the result. The mask embedding will be extracted for each mask proposal using mask pooling. Using the information from mask embeddings and $<\text{SEG}>$ token, the fusion model and an MLP layer will predict a score for each mask proposal. Finally, a simple threshold-based selection is used to pick mask proposals as the final prediction.
Figure 3: Diagram of the fusion module and two selection heads. The input of the fusion module is $K$ mask embeddings and $1$$<\text{SEG}>$ token. After the fusion module, the updated mask embeddings are processed by two separate MLP layers to predict different targets. The target with $256$ dimension is used to compute the loss of the IoU head, while the target with $1$ dimension is used for IoP regression.
Figure 4: The complete prompt template used to prompt the ChatGPT-4. The example part is fixed for all the queries and we only replace the $<\text{summary}>$ and $<\text{important\_objects}>$ which are highlighted in red.
Figure 5: Visual comparison of LLM-Seg (ours) with the SOTA methods. Our LLM-Seg shows high-quality segmentation results even for multiple instances.
...and 1 more figures

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

TL;DR

Abstract

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)