Table of Contents
Fetching ...

Fast Adaptation with Behavioral Foundation Models

Harshit Sikchi, Andrea Tirinzoni, Ahmed Touati, Yingchen Xu, Anssi Kanervisto, Scott Niekum, Amy Zhang, Alessandro Lazaric, Matteo Pirotta

TL;DR

The paper addresses suboptimal zero-shot policies from Behavioral Foundation Models by introducing fast, latent-space online adaptation. It presents two strategies, ReLA (off-policy residual critic) and LoLA (on-policy lookahead), that search in the pre-trained latent space to improve task performance within a small number of environment interactions, while avoiding performance drops. Empirical results across four BFMs and multiple DM Control and Humanoid tasks show 10–40% improvements over zero-shot within tens of episodes, with LoLA achieving monotonic gains and high efficiency. The work demonstrates that BFMs contain superior policies beyond those identified by zero-shot inference and provides practical, scalable methods for rapid adaptation in complex, reward-driven tasks.

Abstract

Unsupervised zero-shot reinforcement learning (RL) has emerged as a powerful paradigm for pretraining behavioral foundation models (BFMs), enabling agents to solve a wide range of downstream tasks specified via reward functions in a zero-shot fashion, i.e., without additional test-time learning or planning. This is achieved by learning self-supervised task embeddings alongside corresponding near-optimal behaviors and incorporating an inference procedure to directly retrieve the latent task embedding and associated policy for any given reward function. Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process, the embedding, and the inference procedure. In this paper, we focus on devising fast adaptation strategies to improve the zero-shot performance of BFMs in a few steps of online interaction with the environment while avoiding any performance drop during the adaptation process. Notably, we demonstrate that existing BFMs learn a set of skills containing more performant policies than those identified by their inference procedure, making them well-suited for fast adaptation. Motivated by this observation, we propose both actor-critic and actor-only fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies on any downstream task. Notably, our approach mitigates the initial "unlearning" phase commonly observed when fine-tuning pre-trained RL models. We evaluate our fast adaptation strategies on top of four state-of-the-art zero-shot RL methods in multiple navigation and locomotion domains. Our results show that they achieve 10-40% improvement over their zero-shot performance in a few tens of episodes, outperforming existing baselines.

Fast Adaptation with Behavioral Foundation Models

TL;DR

The paper addresses suboptimal zero-shot policies from Behavioral Foundation Models by introducing fast, latent-space online adaptation. It presents two strategies, ReLA (off-policy residual critic) and LoLA (on-policy lookahead), that search in the pre-trained latent space to improve task performance within a small number of environment interactions, while avoiding performance drops. Empirical results across four BFMs and multiple DM Control and Humanoid tasks show 10–40% improvements over zero-shot within tens of episodes, with LoLA achieving monotonic gains and high efficiency. The work demonstrates that BFMs contain superior policies beyond those identified by zero-shot inference and provides practical, scalable methods for rapid adaptation in complex, reward-driven tasks.

Abstract

Unsupervised zero-shot reinforcement learning (RL) has emerged as a powerful paradigm for pretraining behavioral foundation models (BFMs), enabling agents to solve a wide range of downstream tasks specified via reward functions in a zero-shot fashion, i.e., without additional test-time learning or planning. This is achieved by learning self-supervised task embeddings alongside corresponding near-optimal behaviors and incorporating an inference procedure to directly retrieve the latent task embedding and associated policy for any given reward function. Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process, the embedding, and the inference procedure. In this paper, we focus on devising fast adaptation strategies to improve the zero-shot performance of BFMs in a few steps of online interaction with the environment while avoiding any performance drop during the adaptation process. Notably, we demonstrate that existing BFMs learn a set of skills containing more performant policies than those identified by their inference procedure, making them well-suited for fast adaptation. Motivated by this observation, we propose both actor-critic and actor-only fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies on any downstream task. Notably, our approach mitigates the initial "unlearning" phase commonly observed when fine-tuning pre-trained RL models. We evaluate our fast adaptation strategies on top of four state-of-the-art zero-shot RL methods in multiple navigation and locomotion domains. Our results show that they achieve 10-40% improvement over their zero-shot performance in a few tens of episodes, outperforming existing baselines.

Paper Structure

This paper contains 27 sections, 2 theorems, 15 equations, 12 figures, 4 tables, 2 algorithms.

Key Result

proposition 1

Let $\phi: S \rightarrow \mathbb{R}^d$ a state feature map and $\{\psi_z \}_{z \in Z}$ the corresponding universal successor features for the policy family $\{\pi_z \}_{z \in Z}$, i.e$\psi_z(s, a) = \mathbb{E} [ \sum_{t \geq 0} \gamma^t \phi(s_{t+1}) \mid (s, a), \pi_z ]$ Then, for any reward functi

Figures (12)

  • Figure 1: Overview of our method: Unsupervised zero-shot RL methods provide us with an initial policy $\pi_{z_r}$; we propose a way to leverage the latent space of learned policies as well as the pre-trained critic to rapidly adapt and improve $\pi_{z_r}$ on few task-specific environment interactions. Right: Illustrative summary of our results.
  • Figure 2: Performance comparison of zero-shot policy vs adapted policy in the BFM's latent space after 200 episodes. The shaded region shows the improvement of the adapted policies averaged across tasks.
  • Figure 3: Qualitative difference in behaviors in 10 episodes of adaptation in HumEnv environment for the task move-ego-low-180-2 with our method LoLA.
  • Figure 4: Top: Performance improvement w.r.t. the zero-shot policy for different online fast adaptation methods and BFMs. TD3(I) denotes standard action-based TD3 with zero-shot policy initialization, our methods are as described in Section \ref{['sec:methods']}. Bottom: Cosine similarity between the zero-shot policy $z_{r}$ and the learned policy $z$ for the methods working in the latent policy space. We report mean and standard deviation over 5 seeds. Results are averaged over 19 tasks for FB, PSM, HILP and 45 tasks for FB-CPR.
  • Figure 5: Average returns for several variations of LoLA, ReLA, and action-based TD3 with warm start. We use no-R to denote that we do not use the BFM's estimated value function (i.e., for LoLA we do not bootstrap the terminal state and for ReLA we learn a critic from scratch) and no-I to denote that we do not use zero-shot policy initialization. Finally, for TD3 we use R to denote that we use residual critic since the standard implementation learns a critic from scratch.
  • ...and 7 more figures

Theorems & Definitions (4)

  • proposition 1
  • proof
  • proposition 2
  • proof