Table of Contents
Fetching ...

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara, Komei Sugiura

TL;DR

This work tackles the OSIM-3D problem: generating pixel-accurate segmentation masks from open-vocabulary manipulation instructions using a multi-module framework. It introduces Polygon Matching Loss based on optimal transport to handle vertex-order invariance, and an Open-Vocabulary 3D Aggregator to reason about objects beyond the camera view. The method combines LLM-based paraphrasing, visual-context descriptions, and cross-modal fusion (SBAE) with a transformer-based vertex predictor, achieving strong gains on the SHIMRIE-3D dataset (mean IoU up to 38.16%) and detailed ablations supporting the contribution of each module. The work advances robust, open-vocabulary segmentation for domestic robots and suggests future semantic-labeling approaches to further reduce misgrounding and ambiguity in rich indoor scenes.

Abstract

We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

TL;DR

This work tackles the OSIM-3D problem: generating pixel-accurate segmentation masks from open-vocabulary manipulation instructions using a multi-module framework. It introduces Polygon Matching Loss based on optimal transport to handle vertex-order invariance, and an Open-Vocabulary 3D Aggregator to reason about objects beyond the camera view. The method combines LLM-based paraphrasing, visual-context descriptions, and cross-modal fusion (SBAE) with a transformer-based vertex predictor, achieving strong gains on the SHIMRIE-3D dataset (mean IoU up to 38.16%) and detailed ablations supporting the contribution of each module. The work advances robust, open-vocabulary segmentation for domestic robots and suggests future semantic-labeling approaches to further reduce misgrounding and ambiguity in rich indoor scenes.

Abstract

We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.
Paper Structure (18 sections, 3 equations, 5 figures, 3 tables)

This paper contains 18 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our method. Our method generates a polygon-based segmentation mask for the target object of a given instruction and image. We introduce the Polygon Matching Loss. The LLM Paraphraser, SBAE, OVA, VCI, and OTVP are explained in Section \ref{['methods']}.
  • Figure 2: Proposed method framework. The proposed method consists of five main modules: LLM Paraphraser, SBAE, OVA, Visual Context Interpreter (VCI), and OTVP. $C\left(\cdot, \cdot\right)$, SAM, OpenScene represent the cost function, Segment Anything ModelKirillov_2023_ICCV, and Open ScenePeng2023OpenScene, respectively.
  • Figure 3: Structure of the SBAE. This enhances the understanding of object segment information, and fuses visual and linguistic features. CA, SAM and FFN represent cross-attention, the Segment Anything ModelKirillov_2023_ICCV and a feed-forward network, respectively.
  • Figure 4: Qualitative results of successful and failure cases. (i) and (ii) show successful examples, and (iii) shows a failure example. The instructions for (i), (ii) and (iii) were as follows: "In the 3rd level bathroom, there is a box of tissues to the left of the basin. Please fetch them"; "Walk to the living room and fetch the leftmost pillow on the smaller white sofa, closest to the plant on the small table." and "Go to the closet in the bedroom with the orange comforter and bring me the second hanger from the top."
  • Figure 5: The instruction sentence for this example was "Go to the bathroom on level 1 and bring me the picture furthest to the left." In this case, the mask generated by Model (v) was slightly skewed toward the sink.