Are Foundation Models the Route to Full-Stack Transfer in Robotics?

Freek Stulp; Samuel Bustamante; João Silvério; Alin Albu-Schäffer; Jeannette Bohg; Shuran Song

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

Freek Stulp, Samuel Bustamante, João Silvério, Alin Albu-Schäffer, Jeannette Bohg, Shuran Song

TL;DR

An overview of the impact that foundation models and transformer networks have had on different levels of abstraction, bringing robots closer than ever to"full-stack transfer" in robotics.

Abstract

In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

TL;DR

An overview of the impact that foundation models and transformer networks have had on different levels of abstraction, bringing robots closer than ever to"full-stack transfer" in robotics.

Abstract

Paper Structure (17 sections, 3 figures, 1 table)

This paper contains 17 sections, 3 figures, 1 table.

Introduction
Transfer in Higher Layers: LLMs and VLMs
Transfer in Lower Layers: Visuomotor policies
Types of Transfer in Visuomotor Policies
Towards Full-stack Transfer: Vision Language Action Models
VLAs with Discrete Action Tokens
Action Compression with Tokenizers
Knowledge Insulation through Gradient Stopping
Knowledge Insulation through In-painting
Approaches for Cross-embodiment Transfer in VLAs
Collecting Action Data
Benchmarking Transfer in the Age of Foundation Models
Towards Best Practices for Benchmarking Transfer
Define explicit transfer splits and use transfer-oriented metrics.
Support modular and compositional evaluation.
...and 2 more sections

Figures (3)

Figure 2: Transformer networks have had a profound impact on the performance of LLMs/VLMs (\ref{['sec:llmvlm']}) and the ability to learn visuomotor policies, e.g. diffusion policies, from demonstrations (\ref{['sec:visuomotor']}). VLAs (\ref{['sec:vlas']}) arose as extensions of VLMs. Denoising-based VLAs (\ref{['sec:gradient_stopping']}) arose by merging diffusion policies or flow matching with VLMs, which was made possible by the shared modality of vision.
Figure 3: Exemplary implementations for three distinct classes of VLAs, according to the classification in kawaharazuka2025vla: $\bullet$ OpenVLA kim2024openvla, see \ref{['sec:vla_dat']}; $\bullet$$\pi_{0}$-FAST pertsch2025fast, see \ref{['sec:action_compression']}; $\bullet$$\pi_{0}$black2024pi0, see \ref{['sec:gradient_stopping']}. The order follows the structure of the article, and is not chronological.
Figure 4: Overview knowledge insulation implementations in $\pi_{0.5}$black2025pi05, $\pi_{0.5}$-KI driess2025knowledgeinsulating, and $\pi_{0.6}$amin2025pi06vlalearnsexperience, which are implementations of denoising-based VLAs. Whereas \ref{['fig:action_compression']} showed $\pi_{0}$ and $\pi_{0}$-FAST and focussed on inference, this figure rather depicts different training phases.

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

TL;DR

Abstract

Are Foundation Models the Route to Full-Stack Transfer in Robotics?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)