Table of Contents
Fetching ...

Foundation Models in Robotics: Applications, Challenges, and the Future

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, Brian Ichter, Danny Driess, Jiajun Wu, Cewu Lu, Mac Schwager

TL;DR

The paper analyzes how foundation models—spanning LLMs, VLMs, embodied multimodal models, and diffusion-based generators—can transform robot perception, decision-making, and control. It surveys methods that fuse language, vision, and action (including open-vocabulary grounding, task planning, and policy learning) and catalogs advances in robot transformers, open-vocabulary navigation/manipulation, and embodied AI benchmarks. It also discusses critical challenges—data scarcity, safety, uncertainty quantification, real-time constraints, and reproducibility—and offers directions such as data-efficient training, synthetic data generation, and calibrated runtime monitoring. Overall, foundation models hold the potential to enable generalist, context-aware robotic systems, but achieving reliable, safe, and real-time operation will require targeted research across data, algorithms, evaluation, and deployment strategies.

Abstract

We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, large language models can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study recent papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models

Foundation Models in Robotics: Applications, Challenges, and the Future

TL;DR

The paper analyzes how foundation models—spanning LLMs, VLMs, embodied multimodal models, and diffusion-based generators—can transform robot perception, decision-making, and control. It surveys methods that fuse language, vision, and action (including open-vocabulary grounding, task planning, and policy learning) and catalogs advances in robot transformers, open-vocabulary navigation/manipulation, and embodied AI benchmarks. It also discusses critical challenges—data scarcity, safety, uncertainty quantification, real-time constraints, and reproducibility—and offers directions such as data-efficient training, synthetic data generation, and calibrated runtime monitoring. Overall, foundation models hold the potential to enable generalist, context-aware robotic systems, but achieving reliable, safe, and real-time operation will require targeted research across data, algorithms, evaluation, and deployment strategies.

Abstract

We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, large language models can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study recent papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models
Paper Structure (58 sections, 7 equations, 1 figure, 2 tables)