Table of Contents
Fetching ...

LaSagnA: Language-based Segmentation Assistant for Complex Queries

Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma

TL;DR

This work addresses the limitations of vLLMs for vision in handling complex queries that involve multiple arbitrary targets and potential absence of queried categories. It introduces LaSagnA, a vLLM-based segmentation assistant that uses a general complex-query sequence format and integrates semantic segmentation training to enable multi-target and open-set reasoning. Three training strategies—sequence augmentation, random class sampling, and target-order consistency—coupled with a joint objective over text and mask losses, empower the model to leverage segmentation datasets effectively. Empirically, LaSagnA achieves competitive or superior results on closed-set and open-set semantic segmentation and outperforms several vLLMs on referring and reasoning segmentation, demonstrating the value of combining segmentation supervision with language-based queries for advanced perception tasks.

Abstract

Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format. The effectiveness of our model in processing complex queries is validated by the comparable results with conventional methods on both close-set and open-set semantic segmentation datasets. Additionally, we outperform a series of vLLMs in reasoning and referring segmentation, showcasing our model's remarkable capabilities. We release the code at https://github.com/congvvc/LaSagnA.

LaSagnA: Language-based Segmentation Assistant for Complex Queries

TL;DR

This work addresses the limitations of vLLMs for vision in handling complex queries that involve multiple arbitrary targets and potential absence of queried categories. It introduces LaSagnA, a vLLM-based segmentation assistant that uses a general complex-query sequence format and integrates semantic segmentation training to enable multi-target and open-set reasoning. Three training strategies—sequence augmentation, random class sampling, and target-order consistency—coupled with a joint objective over text and mask losses, empower the model to leverage segmentation datasets effectively. Empirically, LaSagnA achieves competitive or superior results on closed-set and open-set semantic segmentation and outperforms several vLLMs on referring and reasoning segmentation, demonstrating the value of combining segmentation supervision with language-based queries for advanced perception tasks.

Abstract

Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks. Nonetheless, there are two constraints that restrict the further application of these vLLMs: the incapability of handling multiple targets per query and the failure to identify the absence of query objects in the image. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. Consequently, we define the general sequence format for complex queries. Then we incorporate a semantic segmentation task in the current pipeline to fulfill the requirements of training data. Furthermore, we present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format. The effectiveness of our model in processing complex queries is validated by the comparable results with conventional methods on both close-set and open-set semantic segmentation datasets. Additionally, we outperform a series of vLLMs in reasoning and referring segmentation, showcasing our model's remarkable capabilities. We release the code at https://github.com/congvvc/LaSagnA.
Paper Structure (14 sections, 1 equation, 5 figures, 6 tables)

This paper contains 14 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The comparison between LISALai2023LISARS and LaSagnA on complex query. LISALai2023LISARS requires calling the model $N$ times to obtain the final result (where $N$ is the number of targets in the query), and it fails to identify non-existent categories such as window and cabinet. In contrast, LaSagnA can handle novel categories and accurately identify the existing targets in the image based on a single query.
  • Figure 2: The three problems existing in handling semantic segmentation tasks. (a) Lengthy input caused by a large number of categories in the dataset. (b) Low recall caused by incomplete sequence predictions. (c) Inconsistent category names between queries and responses.
  • Figure 3: Overview of LaSagnA. The vLLM generates a text response based on the instruction text and the input image. The vision encoder and decoder composite a standard SAMkirillov2023segment which is trained to predict a mask based on the textual embedding generated by the vLLM. We only finetune the vLLM using LoRA and train the decoder of SAM.
  • Figure 4: Qualitative results of LaSagnA’s performance on complex queries and single object scenarios.
  • Figure 5: Qualitative results of LaSagnA’s performance on reasoning segmentation.