CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

Feiyang Wang; Xiaomin Yu; Wangyu Wu

CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

Feiyang Wang, Xiaomin Yu, Wangyu Wu

TL;DR

This work tackles grounding natural language in embodied Rubik's Cube manipulation by introducing CubeRobot, a vision-language model augmented with Embodied-Projector grounding, a dual-loop VisionCoT planning architecture, and a Memory Stream for continual learning and reflection. It introduces the CubeCoT dataset (43 subtasks) to evaluate 3x3 Rubik's Cube solving across low, medium, and high difficulties, and demonstrates that CubeRobot achieves 100% accuracy on low- and medium-level tasks and 80% on high-level tasks, surpassing several fine-tuned-baseline models. The key contributions are the Dual-loop CoT for separating high-level planning from low-level execution, the Memory Stream for experience-based memory and reflection, and the end-to-end system validated in both a simulated environment and ablation analyses. The results indicate that integrated vision-language grounding with memory and iterative reasoning can significantly enhance robotic manipulation in complex, dynamic tasks, with potential applicability to broader embodied AI problems.

Abstract

Proving Rubik's Cube theorems at the high level represents a notable milestone in human-level spatial imagination and logic thinking and reasoning. Traditional Rubik's Cube robots, relying on complex vision systems and fixed algorithms, often struggle to adapt to complex and dynamic scenarios. To overcome this limitation, we introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3x3 Rubik's Cubes, empowering embodied agents with multimodal understanding and execution capabilities. We used the CubeCoT image dataset, which contains multiple-level tasks (43 subtasks in total) that humans are unable to handle, encompassing various cube states. We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries, thus enabling CubeRobot to independent planning, decision-making, reflection and separate management of high- and low-level Rubik's Cube tasks. Furthermore, in low-level Rubik's Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.

CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

TL;DR

Abstract

CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)