Table of Contents
Fetching ...

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

Freek Stulp, Samuel Bustamante, João Silvério, Alin Albu-Schäffer, Jeannette Bohg, Shuran Song

TL;DR

An overview of the impact that foundation models and transformer networks have had on different levels of abstraction, bringing robots closer than ever to"full-stack transfer" in robotics.

Abstract

In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

TL;DR

An overview of the impact that foundation models and transformer networks have had on different levels of abstraction, bringing robots closer than ever to"full-stack transfer" in robotics.

Abstract

In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.
Paper Structure (17 sections, 3 figures, 1 table)

This paper contains 17 sections, 3 figures, 1 table.

Figures (3)

  • Figure 2: Transformer networks have had a profound impact on the performance of LLMs/VLMs (\ref{['sec:llmvlm']}) and the ability to learn visuomotor policies, e.g. diffusion policies, from demonstrations (\ref{['sec:visuomotor']}). VLAs (\ref{['sec:vlas']}) arose as extensions of VLMs. Denoising-based VLAs (\ref{['sec:gradient_stopping']}) arose by merging diffusion policies or flow matching with VLMs, which was made possible by the shared modality of vision.
  • Figure 3: Exemplary implementations for three distinct classes of VLAs, according to the classification in kawaharazuka2025vla: $\bullet$ OpenVLA kim2024openvla, see \ref{['sec:vla_dat']}; $\bullet$$\pi_{0}$-FAST pertsch2025fast, see \ref{['sec:action_compression']}; $\bullet$$\pi_{0}$black2024pi0, see \ref{['sec:gradient_stopping']}. The order follows the structure of the article, and is not chronological.
  • Figure 4: Overview knowledge insulation implementations in $\pi_{0.5}$black2025pi05, $\pi_{0.5}$-KI driess2025knowledgeinsulating, and $\pi_{0.6}$amin2025pi06vlalearnsexperience, which are implementations of denoising-based VLAs. Whereas \ref{['fig:action_compression']} showed $\pi_{0}$ and $\pi_{0}$-FAST and focussed on inference, this figure rather depicts different training phases.