Table of Contents
Fetching ...

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, Yujiu Yang

TL;DR

Problem: unify text-guided segmentation tasks across image and video domains into a single framework. Approach: InstructSeg integrates an Object-aware Video Perceiver and Vision-guided Multi-granularity Text Fusion, with a LoRA-tuned LLM and a segmentation decoder, trained end-to-end on IVS datasets. Contributions: unified IVS formulation, novel OVP and VMTF modules enabling temporal/object-aware perception and detailed language grounding, and state-of-the-art results across image and video IVS benchmarks with a compact 3B backbone. Significance: simplifies cross-domain vision-language segmentation and enables scalable end-to-end optimization for practical deployment.

Abstract

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

TL;DR

Problem: unify text-guided segmentation tasks across image and video domains into a single framework. Approach: InstructSeg integrates an Object-aware Video Perceiver and Vision-guided Multi-granularity Text Fusion, with a LoRA-tuned LLM and a segmentation decoder, trained end-to-end on IVS datasets. Contributions: unified IVS formulation, novel OVP and VMTF modules enabling temporal/object-aware perception and detailed language grounding, and state-of-the-art results across image and video IVS benchmarks with a compact 3B backbone. Significance: simplifies cross-domain vision-language segmentation and enables scalable end-to-end optimization for practical deployment.

Abstract

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

Paper Structure

This paper contains 18 sections, 4 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: We define Instructed Visual Segmentation (IVS) as the union of four text-guided segmentation tasks across image and video domains: referring expression segmentation (RES), reasoning segmentation (ReasonSeg), referring video object segmentation (R-VOS) and reasoning video object segmentation (ReasonVOS). InstructSeg can handle all the IVS tasks in one model with excellent performance.
  • Figure 2: Framework of InstructSeg. InstructSeg tackles Instructed Visual Segmentation tasks in an end-to-end pipeline. For challenging video analysis tasks, we employ the object-aware video perceiver to effectively extract both temporal and object-specific information from the reference frames. Besides, InstructSeg is capable of executing comprehensive and accurate vision-language perception and understanding through vision-guided multi-granularity text fusion applied to detailed text embeddings. Finally, the mask embeddings and multi-granularity text embeddings are decoded into segmentation masks and scores.
  • Figure 3: Illustration of Object-aware Video Perceiver (OVP). OVP learns temporal and object information with $N_1$ perceiver layers through the interactions of text and reference frames along with the learnable queries.
  • Figure 4: The structure of the Vision-guided Multi-granularity Text Fusion (VMTF) module.
  • Figure 5: The structure of the Segmentation Decoder module. Following cheng2022masked, we adopt the pixel decoder and transformer decoder to excavate pixel-level visual information and instance-level object information. In contrast, we calculate the similarity between mask embeddings and multi-grained text embeddings as the mask scores for mask proposals' selection.
  • ...and 3 more figures