Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance
Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, Yang Gao
TL;DR
This work tackles the challenge of reward design for robotic manipulation by introducing RoboCoT, a framework that uses a robot chain-of-thought to decompose language instructions into sub-goals and leverages Vision-Language Models to provide fine-grained rewards. A VLM-based self-imitation loop further accelerates learning by reinforcing successful trajectories. Across 10 MetaWorld-v2 tasks, RoboCoT achieves a $5.4×$ improvement over the strongest baseline (RoboCLIP), with ablations confirming the contribution of each component. The approach reduces sample complexity and embeds guidance from language into dense, task-relevant feedback, enabling more robust and scalable manipulation skills.
Abstract
Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing work often provides reward guidance that is too coarse, leading to insufficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4 \times$ higher average success rates compared to the best baseline, RoboCLIP, across a series of manipulation tasks.
