Table of Contents
Fetching ...

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara

TL;DR

This work introduces InstructPart, a real-world benchmark for task-oriented part segmentation that pairs 2,400 everyday-object images with 9,600 task instructions and 2,400 part queries, across 48 object classes and 44 part classes. It defines two tasks, TRPS and ORPS, to evaluate language-grounded and part-grounding capabilities of Vision-Language Models, and systematically benchmarks SOTA VLMs, revealing substantial gaps in fine-grained part reasoning. The authors propose PISA, a simple yet effective baseline leveraging a frozen DINOv2 backbone and a SAM-based decoder, which achieves substantial gains after fine-tuning on InstructPart, illustrating strong training potential. The dataset supports robust evaluation for robotics and manipulation and highlights the need for improved part-level grounding in foundation models to enable practical, real-world tasks.

Abstract

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning

TL;DR

This work introduces InstructPart, a real-world benchmark for task-oriented part segmentation that pairs 2,400 everyday-object images with 9,600 task instructions and 2,400 part queries, across 48 object classes and 44 part classes. It defines two tasks, TRPS and ORPS, to evaluate language-grounded and part-grounding capabilities of Vision-Language Models, and systematically benchmarks SOTA VLMs, revealing substantial gaps in fine-grained part reasoning. The authors propose PISA, a simple yet effective baseline leveraging a frozen DINOv2 backbone and a SAM-based decoder, which achieves substantial gains after fine-tuning on InstructPart, illustrating strong training potential. The dataset supports robust evaluation for robotics and manipulation and highlights the need for improved part-level grounding in foundation models to enable practical, real-world tasks.

Abstract

Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.

Paper Structure

This paper contains 30 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: The task-oriented part segmentation task: Presented with an image observation (left) and a corresponding task to add some water, the system is required to reason about specific parts to fulfill the task.
  • Figure 2: Examples from our InstructPart dataset are illustrated as follows: instruction queries are denoted in red text, while object and part names are indicated in blue. Each example includes an observation image (left), with the corresponding ground truth part segments (right), highlighted with a green mask.
  • Figure 3: Object-part pair distribution. We collect 2,400 data pieces in total, containing 48 object classes and 44 part classes, constituting 98 different object-part pair classes. The x-axis shows the name of the object-part pairs, and the y-axis shows the frequency of each item. The parts belonging to the same object classes are highlighted with the same color in the bar chart.
  • Figure A4: InstructPart dataset object and part classes. The left part shows the object class names and the right part shows the part class names.
  • Figure A5: InstructPart dataset affordance and action categories. The left part shows the affordance names and the right part shows the action names. Specifically, affordances refer to low-level actions performed to a specific part, while actions refer to the high-level function to be achieved.
  • ...and 9 more figures