Table of Contents
Fetching ...

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, Dae-Shik Kim

TL;DR

The paper tackles the limitation of existing reasoning segmentation datasets that focus on single-target object-level reasoning by introducing MMR, a large-scale dataset with 194K complex, implicit Q&A pairs covering multi-target and part-level reasoning. It couples this dataset with M^2SA, a baseline model that fuses early local features and employs multiple [SEG] tokens to enable simultaneous multi-target and fine-grained segmentation, built on SAM and a Multimodal LLM. Experimental results demonstrate that M^2SA achieves competitive or superior performance on MMR and related referring-expression segmentation tasks, validating the effectiveness of its architectural choices and data design. The work advances human-AI interaction in vision-language tasks by supporting context-aware, multi-granularity segmentation, with potential impact on real-world robotics and interactive systems. The dataset and model together offer a foundation for more versatile reasoning segmentation in open-world settings, encouraging further research on efficiency and broader part-level coverage.

Abstract

The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

TL;DR

The paper tackles the limitation of existing reasoning segmentation datasets that focus on single-target object-level reasoning by introducing MMR, a large-scale dataset with 194K complex, implicit Q&A pairs covering multi-target and part-level reasoning. It couples this dataset with M^2SA, a baseline model that fuses early local features and employs multiple [SEG] tokens to enable simultaneous multi-target and fine-grained segmentation, built on SAM and a Multimodal LLM. Experimental results demonstrate that M^2SA achieves competitive or superior performance on MMR and related referring-expression segmentation tasks, validating the effectiveness of its architectural choices and data design. The work advances human-AI interaction in vision-language tasks by supporting context-aware, multi-granularity segmentation, with potential impact on real-world robotics and interactive systems. The dataset and model together offer a foundation for more versatile reasoning segmentation in open-world settings, encouraging further research on efficiency and broader part-level coverage.

Abstract

The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textit{turn on the TV"}, there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

Paper Structure

This paper contains 41 sections, 1 equation, 15 figures, 11 tables.

Figures (15)

  • Figure 1: The prompt used in our data creation process with GPT-4V.
  • Figure 2: An example from the MMR dataset generated through our data creation process. The left and right pictures show the object- and part-level segmentation masks, respectively.
  • Figure 3: Statistics of the proposed MMR dataset. (a) the word cloud for the object categories, (b) the number of objects per each object category in questions (log scale), (c) the word cloud for the part categories, (d) the number of parts per each part category in questions (log scale), (e) the distribution of target count in answers, and (f) the total number of expressions of objects and parts.
  • Figure 4: The overview of M$^{2}$SA framework.
  • Figure 5: To generate question-answer pairs in MMR dataset, we use gpt-4-vision-preview model. For the hyper-parameters, we set the temperature to 0.7 and max_tokens to 850.
  • ...and 10 more figures