Table of Contents
Fetching ...

Empowering Segmentation Ability to Multi-modal Large Language Models

Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao, Jinwei Chen, Bo Li

TL;DR

This paper proposes a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user, and equips the MLLMs' model with strong reasoning segmentation ability.

Abstract

Multi-modal large language models (MLLMs) can understand image-language prompts and demonstrate impressive reasoning ability. In this paper, we extend MLLMs' output by empowering MLLMs with the segmentation ability. The extended MLLMs can both output language responses to the image-language prompts and segment the regions that the complex question or query in the language prompts focuses on. To this end, the existing work, LISA, enlarges the original word embeddings with an additional segment token and fine-tunes dialogue generation and query-focused segmentation together, where the feature of the segment token is used to prompt the segment-anything model. Although they achieve superior segmentation performance, we observe that the dialogue ability decreases by a large margin compared to the original MLLMs. To maintain the original MLLMs' dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user. The MLLMs are first prompted to reason about the simple description of the target region from the complicated user query, then extract the visual attributes of the target region according to the understanding of MLLMs to the image. These visual attributes, such as color and relative locations, are utilized to prompt the downstream segmentation model. Experiments show that the proposed method keeps the original dialogue ability and equips the MLLMs' model with strong reasoning segmentation ability. The code is available at https://github.com/YuqiYang213/LLaVASeg.

Empowering Segmentation Ability to Multi-modal Large Language Models

TL;DR

This paper proposes a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user, and equips the MLLMs' model with strong reasoning segmentation ability.

Abstract

Multi-modal large language models (MLLMs) can understand image-language prompts and demonstrate impressive reasoning ability. In this paper, we extend MLLMs' output by empowering MLLMs with the segmentation ability. The extended MLLMs can both output language responses to the image-language prompts and segment the regions that the complex question or query in the language prompts focuses on. To this end, the existing work, LISA, enlarges the original word embeddings with an additional segment token and fine-tunes dialogue generation and query-focused segmentation together, where the feature of the segment token is used to prompt the segment-anything model. Although they achieve superior segmentation performance, we observe that the dialogue ability decreases by a large margin compared to the original MLLMs. To maintain the original MLLMs' dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user. The MLLMs are first prompted to reason about the simple description of the target region from the complicated user query, then extract the visual attributes of the target region according to the understanding of MLLMs to the image. These visual attributes, such as color and relative locations, are utilized to prompt the downstream segmentation model. Experiments show that the proposed method keeps the original dialogue ability and equips the MLLMs' model with strong reasoning segmentation ability. The code is available at https://github.com/YuqiYang213/LLaVASeg.
Paper Structure (16 sections, 5 equations, 4 figures, 3 tables)

This paper contains 16 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of previous fine-tuning method lai2023lisa and our method. Though previous works empowered the MLLMs with segmentation ability, they downgraded the MLLMs' original reasoning ability. On the contrary, our method not only equips the MLLMs with segmentation ability but maintains the MLLMs' original reasoning ability by freezing parameters.
  • Figure 2: Overall pipeline of our LLaVASeg. Given the user query and image, the MLLMs generate the visual attributes through chain-of-thought prompting. Then, we input these visual attributes into the segmentation model to perform multi-scale prompting and generate the segmentation mask. During training, the MLLMs and the segmentation model are frozen, and only the lightweight adapters are trainable to guarantee efficiency.
  • Figure 3: A failure case study for LLaVASeg and LISA. The upper left corner shows that the LLaVA fails to reason the correct answer from the user query.
  • Figure 4: Visualization of the results of our pipeline. We also show one failure case caused by a mistake in the reasoning step for a deeper understanding to our method.