DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

Ryosuke Korekata; Kanta Kaneda; Shunya Nagashima; Yuto Imai; Komei Sugiura

DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

Ryosuke Korekata, Kanta Kaneda, Shunya Nagashima, Yuto Imai, Komei Sugiura

TL;DR

This work introduces DM2RM, a dualmode multimodal ranking model that retrieves images of both target objects and receptacles from pre collected indoor images using open vocabulary instructions. It employs three novel modules—Task Paraphraser for standardizing instructions, Switching Phrase Encoder for mode conditioned embedding, and Segment Anything Region Encoder for shapeaware visual features—to enable a single model to perform dual retrieval and guide fetchand carry actions. Evaluation on the LTRRIE-FC dataset and realworld physical tests shows significant improvements over baselines in unseen environments and robust zero shot transfer, with a task success rate above 80 percent in real robot trials. These results demonstrate the practicality of open vocabulary, image retrieval driven planning for mobile manipulation and pave the way for multiobject and richer instruction grounding in domestic service robots.

Abstract

In this study, we aim to develop a domestic service robot (DSR) that, guided by open-vocabulary instructions, can carry everyday objects to the specified pieces of furniture. Few existing methods handle mobile manipulation tasks with open-vocabulary instructions in the image retrieval setting, and most do not identify both the target objects and the receptacles. We propose the Dual-Mode Multimodal Ranking model (DM2RM), which enables images of both the target objects and receptacles to be retrieved using a single model based on multimodal foundation models. We introduce a switching mechanism that leverages a mode token and phrase identification via a large language model to switch the embedding space based on the prediction target. To evaluate the DM2RM, we construct a novel dataset including real-world images collected from hundreds of building-scale environments and crowd-sourced instructions with referring expressions. The evaluation results show that the proposed DM2RM outperforms previous approaches in terms of standard metrics in image retrieval settings. Furthermore, we demonstrate the application of the DM2RM on a standardized real-world DSR platform including fetch-and-carry actions, where it achieves a task success rate of 82% despite the zero-shot transfer setting. Demonstration videos, code, and more materials are available at https://kkrr10.github.io/dm2rm/.

DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 8 figures, 7 tables)

This paper contains 24 sections, 6 equations, 8 figures, 7 tables.

Introduction
Related Work
Language-Guided Embodied AI
Multimodal Language Understanding
Problem Statement
Proposed Method
Input
Task Paraphraser
Switching Phrase Encoder
Segment Anything Region Encoder
Experiments
Dataset
Parameter Settings
Quantitative Results
Qualitative Results
...and 9 more sections

Figures (8)

Figure 1: Overview of our method. First, the DSR collects images of the environment through pre-exploration. Given the open-vocabulary instruction, it is required to retrieve the red and green framed images as the target object image and receptacle image from the collected images, respectively. Subsequently, the DSR carries the target object to the receptacle, based on the user-selected images.
Figure 2: Architecture of the DM$^2$RM. 'MLP,' 'Sim,' and '$\oplus$' represent the multi-layer perceptron, cosine similarity, and concatenation, respectively.
Figure 3: Annotation interface. Annotators were required to give instructions for the DSR to carry the target object (a red bounding box) to the receptacle (a green bounding box). These instructions were input in the text box below the images.
Figure 4: Qualitative comparison between our method and a baseline method kaneda2024learning. For each sample, $\bm{x}_\mathrm{txt}$ and/or $\bm{x}^{\prime}_\mathrm{txt}$, the top-3 retrieved images, and the GT image are shown. The results regarding the target object and receptacle are shown on the left (*-a) and right (*-b), respectively. The target object images and receptacle images are highlighted in the red and green frames, respectively. The words underlined in red, green, and black indicate $\bm{x}_\mathrm{targ}$, $\bm{x}_\mathrm{rec}$, and grammatical errors, respectively.
Figure 5: A failure sample on the HM3D-FC test set. Rows (a) and (b) show the qualitative results in the target and receptacle modes, respectively. From left to right: GT images and top-3 retrieved images. The words highlighted in red and green indicate $\bm{x}_\mathrm{targ}$ and $\bm{x}_\mathrm{rec}$, respectively.
...and 3 more figures

DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

TL;DR

Abstract

DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (8)