Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Miyu Goko; Motonari Kambara; Daichi Saito; Seitaro Otsuki; Komei Sugiura

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei Sugiura

TL;DR

The paper tackles task success prediction for open-vocabulary manipulation by predicting a probability $P(\\hat{y}=1)$ from an instruction and before/after egocentric images. It introduces Contrastive $\\lambda$-Repformer, which builds a multi-level visual-language representation $h_{\\lambda}$ by fusing Scene, Aligned, and Narrative representations and uses a CrossAttn-based decoder to align image differences with the instruction. The core contributions are the $\\lambda$-Representation Encoder, the Contrastive $\\lambda$-Representation Decoder, and extensive empirical validation on SP-RT-1 (and SP-HSR in the physical setting), where the proposed method outperforms representative MLLMs and baselines, achieving up to $8.66$ percentage points improvement in accuracy. The work advances reliable SPOM by enabling fine-grained object and relational understanding in open-vocabulary tasks, with potential impact on robustness and safety in real-world robotic manipulation.

Abstract

In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $λ$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $λ$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

TL;DR

The paper tackles task success prediction for open-vocabulary manipulation by predicting a probability

from an instruction and before/after egocentric images. It introduces Contrastive

-Repformer, which builds a multi-level visual-language representation

by fusing Scene, Aligned, and Narrative representations and uses a CrossAttn-based decoder to align image differences with the instruction. The core contributions are the

-Representation Encoder, the Contrastive

-Representation Decoder, and extensive empirical validation on SP-RT-1 (and SP-HSR in the physical setting), where the proposed method outperforms representative MLLMs and baselines, achieving up to

percentage points improvement in accuracy. The work advances reliable SPOM by enabling fine-grained object and relational understanding in open-vocabulary tasks, with potential impact on robustness and safety in real-world robotic manipulation.

Abstract

-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive

-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

Paper Structure (26 sections, 3 equations, 14 figures, 6 tables)

This paper contains 26 sections, 3 equations, 14 figures, 6 tables.

Introduction
Related Work
Proposed Method
$\lambda$-Representation
$\lambda$-Representation Encoder
Contrastive $\lambda$-Representation Decoder
Experimental Results
Experimental Setup
Quantitative Results
Qualitative Results
Ablation Study
Conclusions and Limitations
Additional Related Work
Details of Modules
Details of Experimental Setup
...and 11 more sections

Figures (14)

Figure 1: (left) An overview of the novel representation: $\lambda$-Representation, which is an integration of three types of representations. (right) A few examples of our task. The task is to predict success or failure based on an open-vocabulary instruction sentence, and egocentric images taken before and after the manipulation.
Figure 2: Overview of Contrastive $\lambda$-Repformer. Given an instruction sentence and images before and after manipulation, our model outputs the predicted probability that the robot successfully performed the manipulation.
Figure 3: Experimental environment. The left and right images show the state before and after manipulation, respectively. Instruction sentences, such as "place a mug in front of the banana," were created based on the situation before the manipulation. Examples of the egocentric images are shown at the top right of each exocentric image.
Figure 4: Successful cases of Contrastive $\lambda$-Repformer on the SP-RT-1 dataset. Examples (i) and (ii) are true positive cases, and (iii) is a true negative case. In each example, the left and right images show the scene before and after the manipulation, respectively.
Figure 5: Qualitative results of the proposed method in zero-shot transfer experiment. Examples (i) and (ii) are true positive and true negative cases, respectively. In each example, the left and right images show the scene before and after the manipulation, respectively.
...and 9 more figures

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

TL;DR

Abstract

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (14)