Table of Contents
Fetching ...

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Li Zhou, Xu Yuan, Zenghui Sun, Zikun Zhou, Jingsong Lan

TL;DR

A Multi-Granularity Large Multimodal Model (MGLMM) is introduced, capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap and a novel unified SegCap data format is proposed to unify heterogeneous segmentation datasets.

Abstract

Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues.To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

TL;DR

A Multi-Granularity Large Multimodal Model (MGLMM) is introduced, capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap and a novel unified SegCap data format is proposed to unify heterogeneous segmentation datasets.

Abstract

Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues.To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.
Paper Structure (13 sections, 3 equations, 4 figures, 7 tables)

This paper contains 13 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: MGLMM is a versatile and sophisticated LMM, which can handle various tasks involving textual and pixel-level mask responses. We show its visualization results in the following scenarios: multi-granularity segmentation and captioning, referring segmentation, multiple/empty segmentation, panoptic segmentation, reasoning segmentation, image-level captioning, and conversation.
  • Figure 2: Qualitative comparison of GLaMM and our MGLMM. Please refer to Appendix. A for more details.
  • Figure 3: Left: The model architecture of MGLMM. Right: The proposed unified data format for multi-task learning.
  • Figure 4: The overview of our proposed data auto-annotated pipeline. Due to space limitations, the detailed caption is not shown in the figure. Please refer to the Appendix. B for the detailed version. Best viewed with zoom-in.