InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei; Yujie Zhong; Haoxian Tan; Yingsen Zeng; Yong Liu; Zheng Zhao; Yujiu Yang

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, Yujiu Yang

TL;DR

Problem: unify text-guided segmentation tasks across image and video domains into a single framework. Approach: InstructSeg integrates an Object-aware Video Perceiver and Vision-guided Multi-granularity Text Fusion, with a LoRA-tuned LLM and a segmentation decoder, trained end-to-end on IVS datasets. Contributions: unified IVS formulation, novel OVP and VMTF modules enabling temporal/object-aware perception and detailed language grounding, and state-of-the-art results across image and video IVS benchmarks with a compact 3B backbone. Significance: simplifies cross-domain vision-language segmentation and enables scalable end-to-end optimization for practical deployment.

Abstract

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

TL;DR

Abstract

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)