InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara
TL;DR
This work introduces InstructPart, a real-world benchmark for task-oriented part segmentation that pairs 2,400 everyday-object images with 9,600 task instructions and 2,400 part queries, across 48 object classes and 44 part classes. It defines two tasks, TRPS and ORPS, to evaluate language-grounded and part-grounding capabilities of Vision-Language Models, and systematically benchmarks SOTA VLMs, revealing substantial gaps in fine-grained part reasoning. The authors propose PISA, a simple yet effective baseline leveraging a frozen DINOv2 backbone and a SAM-based decoder, which achieves substantial gains after fine-tuning on InstructPart, illustrating strong training potential. The dataset supports robust evaluation for robotics and manipulation and highlights the need for improved part-level grounding in foundation models to enable practical, real-world tasks.
Abstract
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
