Real-World Robot Applications of Foundation Models: A Review

Kento Kawaharazuka; Tatsuya Matsushima; Andrew Gambardella; Jiaxian Guo; Chris Paxton; Andy Zeng

Real-World Robot Applications of Foundation Models: A Review

Kento Kawaharazuka, Tatsuya Matsushima, Andrew Gambardella, Jiaxian Guo, Chris Paxton, Andy Zeng

TL;DR

An overview of the practical application of foundation models in real-world robotics, with a primary emphasis on the replacement of specific components within existing robot systems.

Abstract

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-Language Models (VLMs), trained on extensive data, facilitate flexible application across different tasks and modalities. Their impact spans various fields, including healthcare, education, and robotics. This paper provides an overview of the practical application of foundation models in real-world robotics, with a primary emphasis on the replacement of specific components within existing robot systems. The summary encompasses the perspective of input-output relationships in foundation models, as well as their role in perception, motion planning, and control within the field of robotics. This paper concludes with a discussion of future challenges and implications for practical robot applications.

Real-World Robot Applications of Foundation Models: A Review

TL;DR

An overview of the practical application of foundation models in real-world robotics, with a primary emphasis on the replacement of specific components within existing robot systems.

Abstract

Paper Structure (37 sections, 5 figures, 2 tables)

This paper contains 37 sections, 5 figures, 2 tables.

Introduction
Foundation Models
Foundation Models for Language
Foundation Models for Vision
Foundation Models for Vision and Language
Foundation Models for Audio
Foundation Models for 3D Representation
Foundation Models for Other Modalities
Applications of Foundation Models to Robotics
Low-level Perception
Low-level Perception for Feature Extraction
Low-level Perception for Scene Recoginition
High-level Perception
High-level Perception for Objective Design
High-level Perception for Map Construction
...and 22 more sections

Figures (5)

Figure 1: The structure of this study. In \ref{['sec:foundation-models']}, we overview the characteristics of foundation models and introduce common downstream tasks. In \ref{['sec:fm_application_robotics']}, we categorize studies of applications of foundation models in robotics. In \ref{['sec:fm_for_robotics']}, we introduce prior work on creating foundation models for robotics, so-called robotic foundation models. In \ref{['sec:robot_task_environment']}, we overview robots, tasks, and environments used for applications of foundation models in robotics.
Figure 2: The overview of foundation models classified by the modalities such as language, vision, audio, and 3D representation, and by the network input and output.
Figure 3: The overview of utilization of foundation models for robots. With foundation models, low-level perception conducts feature extraction or scene recognition, high-level perception conducts reward generation or map construction, high-level planning conducts task planning or code generation, low-level planning conducts footstep generation or command generation, and data augmentation conducts image augmentation or instruction augmentation.
Figure 4: The four types of combinations of low-level perception, high-level perception, high-level planning, low-level planning, and data augmentation with foundation models.
Figure 5: The overview of robots, tasks, and environments used for research with foundation models.

Real-World Robot Applications of Foundation Models: A Review

TL;DR

Abstract

Real-World Robot Applications of Foundation Models: A Review

Authors

TL;DR

Abstract

Table of Contents

Figures (5)