Table of Contents
Fetching ...

Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting

Edoardo Cetin, Ahmed Touati, Yann Ollivier

TL;DR

This work addresses zero-shot reinforcement learning with Forward-Backward representations by tackling two bottlenecks: linear task encoding and offline-data optimization. It introduces auto-regressive task features to enable nonlinear, hierarchical task encodings and integrates advanced offline-RL techniques (advantage weighting, evaluation-based sampling, and uncertainty modeling) to improve learning from reward-free offline datasets. Empirically, AR-FB with AW and AWARE delivers strong performance on Jaco Arm, DMC Locomotion, MOOD datasets, and D4RL benchmarks, matching or approaching task-specific offline methods in several settings. The results indicate that zero-shot behavioral foundation models can reach a substantial fraction of specialized offline-RL performance, with AR-FB offering moderate gains in spatial precision and out-of-dataset generalization under adequate offline data conditions.

Abstract

The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific techniques.We show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.

Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting

TL;DR

This work addresses zero-shot reinforcement learning with Forward-Backward representations by tackling two bottlenecks: linear task encoding and offline-data optimization. It introduces auto-regressive task features to enable nonlinear, hierarchical task encodings and integrates advanced offline-RL techniques (advantage weighting, evaluation-based sampling, and uncertainty modeling) to improve learning from reward-free offline datasets. Empirically, AR-FB with AW and AWARE delivers strong performance on Jaco Arm, DMC Locomotion, MOOD datasets, and D4RL benchmarks, matching or approaching task-specific offline methods in several settings. The results indicate that zero-shot behavioral foundation models can reach a substantial fraction of specialized offline-RL performance, with AR-FB offering moderate gains in spatial precision and out-of-dataset generalization under adequate offline data conditions.

Abstract

The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati & Ollivier, 2021) to train behavior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training. First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre-trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introducing auto-regressive features for FB, which let finegrained task features depend on coarser-grained task information. This can represent arbitrary nonlinear task encodings, thus significantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires specific techniques.We show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al.,2020b; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of standard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task generalization beyond the behaviors represented in the trainset.

Paper Structure

This paper contains 42 sections, 4 theorems, 29 equations, 13 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.2

Assume we have learned representations $F\colon S\times A\times {\mathbb{R}}^d \to {\mathbb{R}}^d$ and $B\colon S\times {\mathbb{R}}^d\to {\mathbb{R}}^d$, as well as a parametric family of policies $\pi_z$ depending on $z\in {\mathbb{R}}^d$, satisfying Then the following holds. For any reward function $r$, if we can find a value $z_r\in {\mathbb{R}}^d$ such that then $\pi_{z_r}$ is an optimal po

Figures (13)

  • Figure 1: An auto-regressive architecture for $B(s,z)$. The $i$-th block of the output $B$ only depends on blocks $z_1,\ldots,z_{i-1}$ of the input $z$. In each layer, the weights from each block to the lower-ranking blocks of the next layer have been removed. The state $s$ is still fed to every block on the input layer.
  • Figure 2: Average cumulative reward achieved by the algorithms, trained on RND dataset for different representation dimensions when aiming to reach goals (four randomly selected goals and four corner goals), in the Jaco arm environment.
  • Figure 3: Averaged cumulative reward achieved by the algorithms on in-dataset tasks, trained on MOOD dataset for DMC Locomotion.
  • Figure 4: Average cumulative reward achieved by the algorithms on out-of-dataset tasks, trained on MOOD dataset for DMC Locomotion.
  • Figure 5: Example of behaviors inferred by from reward equations.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Definition 3.1
  • Theorem 3.2
  • Definition A.1: Extended forward-backward representation of an MDP
  • Theorem A.2: Forward-backward representation of an MDP, with features as goals
  • proof : Proof of Theorem \ref{['thm:arfb2']}
  • Theorem A.3: Auto-regressive features with two levels are a universal approximator for task encoding
  • Lemma A.4
  • proof : Proof of the lemma
  • proof