Table of Contents
Fetching ...

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang

TL;DR

RFST addresses the challenge of translating language into robot actions by using a dual-system architecture inspired by human cognition. It combines a fast-thinking policy for simple instructions with a slow-thinking vision-language reasoning module guided by an instruction discriminator and Think Bank. The slow-thinking module grounds reasoning and intent recognition using a fine-tuned VLM (ViT-L/14 CLIP backbone with LLaMA-2-7B) and CLIP-based grounding to produce subgoals for the policy network. The authors show improved performance on complex tasks in simulation (VIMA-Bench) and real-world experiments with a Franka arm, and provide a real-world trajectory dataset for fast and slow thinking tasks. This framework offers a practical path toward language-conditioned manipulation that can handle both straightforward and reasoning-intensive tasks with reduced data requirements.

Abstract

The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

TL;DR

RFST addresses the challenge of translating language into robot actions by using a dual-system architecture inspired by human cognition. It combines a fast-thinking policy for simple instructions with a slow-thinking vision-language reasoning module guided by an instruction discriminator and Think Bank. The slow-thinking module grounds reasoning and intent recognition using a fine-tuned VLM (ViT-L/14 CLIP backbone with LLaMA-2-7B) and CLIP-based grounding to produce subgoals for the policy network. The authors show improved performance on complex tasks in simulation (VIMA-Bench) and real-world experiments with a Franka arm, and provide a real-world trajectory dataset for fast and slow thinking tasks. This framework offers a practical path toward language-conditioned manipulation that can handle both straightforward and reasoning-intensive tasks with reduced data requirements.

Abstract

The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual process theory in cognitive science, which suggests two parallel systems of fast and slow thinking in human decision-making, we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision language model aligned with the policy networks, which allows the robot to recognize user intention or perform reasoning tasks. To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning. The project is available at https://jlm-z.github.io/RSFT/
Paper Structure (11 sections, 6 figures, 1 table)

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: The overview of RFST. We collected a number of instructions and employed GPT4 gpt4 for annotation. Upon receiving an instruction, the robot processes it through Distil-RoBERTa to obtain an embedding. Leveraging embedding similarity search, we classified the instruction into either a fast-thinking system or a slow-thinking system.
  • Figure 2: An illustrative example of step-by-step task planning originates from GPT-3.5-turbo. The planning produced by the LLM serves as the foundation for formulating our text-image pairs used for VLM training.
  • Figure 3: An illustration of CLIP computing the similarity between step-wise text description and observations.
  • Figure 4: We collect a dataset with real-world trajectories using a Franka robotic arm. Each trajectory is a sequence of images from two cameras. We consider multiple tasks that belong to either the fast-thinking system or the slow-thinking system.
  • Figure 5: Example of tasks in simulation. We select six tasks in VIMA-Bench jiang2022vima and categorize them into fast-thinking and slow-thinking tasks accordingly.
  • ...and 1 more figures