What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

Dingzhe Li; Yixiang Jin; Yuhao Sun; Yong A; Hongze Yu; Jun Shi; Xiaoshuai Hao; Peng Hao; Huaping Liu; Xiang Li; Xinde Li; Fuchun Sun; Jianwei Zhang; Bin Fang

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

Dingzhe Li, Yixiang Jin, Yuhao Sun, Yong A, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Xiang Li, Xinde Li, Fuchun Sun, Jianwei Zhang, Bin Fang

TL;DR

This survey identifies a path to general manipulation by integrating foundation models across modular robot-learning components. It presents a comprehensive framework with modules for interaction, pre/post-condition detection, skill hierarchy, state perception, policy and transition learning, and data generation, and analyzes how LLMs, VFMs, VLMs, LMMs, VGMs, and RFMs can address challenges in each. Key contributions include mapping RFMs to specific manipulation challenges, proposing a framework for general manipulation, and outlining data-generation, simulation-to-real transfer, and benchmarking considerations. The findings underscore both the potential of RFMs to enhance perception, planning, and learning efficiency, and the practical hurdles of safety, scalability, and cross-embodiment generalization in real-world deployment.

Abstract

The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots' ability to manipulate objects in their unstructured surrounding environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we believe achieving general manipulation capability requires an overarching framework akin to auto driving. This framework should encompass multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability. This survey focuses on the contributions of foundation models to robot learning for manipulation. We propose a comprehensive framework and detail how foundation models can address challenges in each module of the framework. What's more, we examine current approaches, outline challenges, suggest future research directions, and identify potential risks associated with integrating foundation models into this domain.

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

TL;DR

Abstract

Paper Structure (43 sections, 10 figures, 6 tables)

This paper contains 43 sections, 10 figures, 6 tables.

Introduction
Framework of Robot Learning for General Manipulation
Human/Agent Interaction
Pre- and Post-conditions Detection
Object Affordance
Object Recognition
Hierarchy of skills
State
3D Reconstruction
Pose Estimation
Policy
VLAC
VLAKP
VLADP
Foundation Models assisting for Reinforcement Learning
...and 28 more sections

Figures (10)

Figure 1: LLMs help address challenges in Interaction, Manipulation Data Generation, Hierarchy of Skills, Skill Policy Learning, and Environment Transition Model. VLMs assist in tackling challenges in Interaction, Manipulation Data Generation, Hierarchy of Skills, Pre- and Post-conditions Detection, Skill Policy Learning, and Perception. LMMs aid in addressing challenges in Interaction and Perception. VGMs tackle the challenge of Manipulation Data Generation and Environment Transition. VFMs help address challenges in Manipulation Data Generation, Hierarchy of Skills, Pre- and Post-conditions Detection, Skill Policy Learning, and Perception. RFMs assist in addressing the challenge of Skill Policy Learning.
Figure 2: Framework of Robot Learning for General Manipulation. The Pre-conditions Detection module $P$ perceives the environment to identify objects and the affordances objects support. The Interaction module $I$ receives instruction from a human or other agent. It uses perception information from the Pre-conditions Detection module $P$ to check for ambiguities in the instruction. If there are any ambiguities, it generates a question to clarify the instruction by asking the human or other agent. The Hierarchy of Skills module $H$ generates subgoals by using precise instruction from the Interaction module $I$ and perception information from the Pre-conditions Detection module $P$. Each subgoal is then passed to the Skill Execution module. In the Skill Execution module, Policy module $\Pi$ generates Action $\alpha$ based on the State $S$. To obtain the next state after executing the current action, State $S$ can either perceive it from the environment or use the Transition module $T$. To train the Skill Execution module, including the State module $S$, the Policy module $\Pi$ and the Transition module $T$, the Manipulation Data Generation module is required. This module provides a task-level manipulation dataset. When issues arise during execution, corrective instruction is sent to the Policy module $\Pi$ for manual adjustment. Policy module $\Pi$ modifies the current action to corrective action and saves corrective demonstration to the dataset for self-improvement of Policy module $\Pi$. After skill execution, Post-conditions Detection module $P$ determines the success of execution. If successful, proceed to the next subgoal; if not, the failure reason is conveyed to Post-hoc Correction module for self-correction.
Figure 3: Foundation Models for Interaction Module. Interaction mainly involves the exchange between task instruction and corrective instruction. Ambiguity often arises in task instruction interaction, hence robot needs to detect ambiguities. 1) One approach is to perceive objects in a multi-modal environment and enumerate possible ambiguities based on perception information (mo2023towards). 2) Another approach involves using LLM to be the next step prediction module, which predicts and scores the next step; if the scores of the top 2 steps are less than $\delta$, it is considered that the task goal is ambiguous (ren2023robots). 3) Strong comprehension skills are required during the transmission of corrective instruction, and the current mainstream approach involves using the encoder of LLM to extract tokens and input them into the policy to modify the original trajectory (bucker2023latte).
Figure 4: Foundation Models for Pre-conditions Detection. As for object affordance, the main approaches of task-oriented grasp are supervised learning and reinforcement learning. Both methods utilize LLM to generate object part-level description and desired affordance description in task instruction, then fuse tokens and features into the original network through language encoder and image encoder to output task-oriented grasp pose (tang2023graspgptren2023leveraging). In reinforcement learning, it is possible to choose between a LLM language encoder with a custom-designed image encoder, or a VLM language encoder with a VLM image encoder. When selecting the LLM language encoder with a custom image encoder, the LLM language encoder should be frozen, and the custom image encoder should be trained (ren2023leveraging). When using the VLM language encoder with the VLM image encoder, both encoders are typically frozen (xu2023joint). Direct using foundation method utilizes LLM to generate object part-level description and desired affordance description according to task instruction. VLM marks out the part of the object to grasp in the image based on the description (liu2023partslip). As for object recognition, the representation learning methods in state perception mainly include contrastive learning (radford2021learning), distillation-based learning (caron2021emerging), and masked autoencoder learning (radosavovic2023real). Masked autoencoding methods prioritize low-level spatial aspects, sacrificing high-level semantics, whereas contrastive learning methods focus on the inverse, the fusion of masked autoencoder and contrastive learning is employed in both Voltron (karamcheti2023language) and iBOT (zhou2021ibot). Multimodal representation learning focuses primarily on multimodal alignment (xue2023ulip2tatiya2023mosaic). Training the encoder with large-scale data and parameters has facilitated open-set perception, including tasks such as open-set detection, open-set segmentation. For instance, SAM (kirillov2023segment) utilizes the MAE (he2022masked), ViLD (gu2021open) employs the CLIP (radford2021learning).
Figure 5: Foundation Models for Hierarchy of Skills. 1) Utilize human operation video to learn the skill sequence for task execution, decompose the video of the user's progress so far into observations and human actions through segmentation, and input them along with task instruction into a pre-trained language model to predict the next step (patel2023pretrained). 2) LLM scores the skills in the skill library based on task instruction and the skills already executed, and the value function also scores the skills in the skill library based on observation images. The highest-scoring skill, obtained by multiplying the two scores, is selected as the next step (ahn2022can). The value function can consider multiple factors such as affordance, safety, user preference, and more (huang2023grounded), and these considerations can also be fine-tuning LLM (wu2023tidybot). 3) LLM assists the classical planner by translating task instruction into PDDL descriptions, sending them to the classical planner to generate a PDDL plan, and then translating the PDDL plan into a natural language plan using LLM (liu2023llm).
...and 5 more figures

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

TL;DR

Abstract

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (10)