Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations
Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei Sugiura
TL;DR
The paper tackles task success prediction for open-vocabulary manipulation by predicting a probability $P(\\hat{y}=1)$ from an instruction and before/after egocentric images. It introduces Contrastive $\\lambda$-Repformer, which builds a multi-level visual-language representation $h_{\\lambda}$ by fusing Scene, Aligned, and Narrative representations and uses a CrossAttn-based decoder to align image differences with the instruction. The core contributions are the $\\lambda$-Representation Encoder, the Contrastive $\\lambda$-Representation Decoder, and extensive empirical validation on SP-RT-1 (and SP-HSR in the physical setting), where the proposed method outperforms representative MLLMs and baselines, achieving up to $8.66$ percentage points improvement in accuracy. The work advances reliable SPOM by enabling fine-grained object and relational understanding in open-vocabulary tasks, with potential impact on robustness and safety in real-world robotic manipulation.
Abstract
In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $λ$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $λ$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.
