Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang
TL;DR
RFST addresses the challenge of translating language into robot actions by using a dual-system architecture inspired by human cognition. It combines a fast-thinking policy for simple instructions with a slow-thinking vision-language reasoning module guided by an instruction discriminator and Think Bank. The slow-thinking module grounds reasoning and intent recognition using a fine-tuned VLM (ViT-L/14 CLIP backbone with LLaMA-2-7B) and CLIP-based grounding to produce subgoals for the policy network. The authors show improved performance on complex tasks in simulation (VIMA-Bench) and real-world experiments with a Franka arm, and provide a real-world trajectory dataset for fast and slow thinking tasks. This framework offers a practical path toward language-conditioned manipulation that can handle both straightforward and reasoning-intensive tasks with reduced data requirements.
Abstract
The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/
