Table of Contents
Fetching ...

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, Yang Gao

TL;DR

This work tackles the challenge of reward design for robotic manipulation by introducing RoboCoT, a framework that uses a robot chain-of-thought to decompose language instructions into sub-goals and leverages Vision-Language Models to provide fine-grained rewards. A VLM-based self-imitation loop further accelerates learning by reinforcing successful trajectories. Across 10 MetaWorld-v2 tasks, RoboCoT achieves a $5.4×$ improvement over the strongest baseline (RoboCLIP), with ablations confirming the contribution of each component. The approach reduces sample complexity and embeds guidance from language into dense, task-relevant feedback, enabling more robust and scalable manipulation skills.

Abstract

Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing work often provides reward guidance that is too coarse, leading to insufficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4 \times$ higher average success rates compared to the best baseline, RoboCLIP, across a series of manipulation tasks.

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

TL;DR

This work tackles the challenge of reward design for robotic manipulation by introducing RoboCoT, a framework that uses a robot chain-of-thought to decompose language instructions into sub-goals and leverages Vision-Language Models to provide fine-grained rewards. A VLM-based self-imitation loop further accelerates learning by reinforcing successful trajectories. Across 10 MetaWorld-v2 tasks, RoboCoT achieves a improvement over the strongest baseline (RoboCLIP), with ablations confirming the contribution of each component. The approach reduces sample complexity and embeds guidance from language into dense, task-relevant feedback, enabling more robust and scalable manipulation skills.

Abstract

Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing work often provides reward guidance that is too coarse, leading to insufficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a higher average success rates compared to the best baseline, RoboCLIP, across a series of manipulation tasks.
Paper Structure (13 sections, 4 equations, 6 figures)

This paper contains 13 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: First, the robot receives the language instruction, e.g., "open the door." Subsequently, our algorithm interprets this instruction using a robot chain-of-thought processing, breaking it down into several detailed prompts. Then, the robot initiates reinforcement learning with our designed reward signal to learn the skills with the consideration about the failure guidance. Successful experiences are recorded by foundation models to reinforce effective behaviors, i.e., VLM-based self-imitation.
  • Figure 2: First, text embeddings are generated with robot chain-of-thought (positive prompts) and sparse failure guidance (negative prompts). Next, a moving window slides across the agent's rollout trajectory, calculating the NCE reward based on both video and text embeddings. Finally, VLMs evaluate the final observation. If this observation is deemed successful, this trajectory is recorded in the replay buffer for self imitation procedure.
  • Figure 3: Training curves for baselines and our algorithm within 3 random seeds. Each data point is evaluated with 20 sampled trajectories. The shaded area displays the range of one standard deviation. Our algorithm gets best performance compared to other baselines, achieves $5.4\times$ average improvement compared to RoboCLIP.
  • Figure 4: Final policy performance for ablations and our RoboCoT. The results show that each component of our RoboCoT contributes to the final performance.
  • Figure 5: Ablation study for our algorithm with different success experience collection methods (GPT-4V or ground truth (GT) labels). The results demonstrate that the GPT-4V are robust in collecting success experience in a subset of robotic tasks.
  • ...and 1 more figures