Table of Contents
Fetching ...

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Daniel Lawson, Adriana Hugessen, Charlotte Cloutier, Glen Berseth, Khimya Khetarpal

TL;DR

The paper addresses the challenge of combinatorial generalization in goal-conditioned behavioral cloning by linking temporal coherence in representations to the successor representation. It introduces BYOL-$\gamma$, a self-predictive objective that samples future states with $k \sim \mathrm{Geom}(1-\gamma)$ to capture long-range dynamics and approximates $\tilde{M}^{\pi}$ via a low-rank decomposition $\tilde{M}^{\pi} \approx \Phi \Psi \Phi^T$. The authors show theoretically that, in finite MDPs with linear features, BYOL-$\gamma$ approximates SR and empirically demonstrate improved zero-shot combinatorial generalization on OGBench, often outperforming baseline BC and TD-based methods, with ablations clarifying the impact of key components. The work highlights the practical value of meaningful representations as auxiliary objectives for scaling offline, goal-conditioned policies to longer horizons and more complex environments.

Abstract

While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate generalization. We then propose a simple yet effective representation learning objective, $\text{BYOL-}γ$ for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

TL;DR

The paper addresses the challenge of combinatorial generalization in goal-conditioned behavioral cloning by linking temporal coherence in representations to the successor representation. It introduces BYOL-, a self-predictive objective that samples future states with to capture long-range dynamics and approximates via a low-rank decomposition . The authors show theoretically that, in finite MDPs with linear features, BYOL- approximates SR and empirically demonstrate improved zero-shot combinatorial generalization on OGBench, often outperforming baseline BC and TD-based methods, with ablations clarifying the impact of key components. The work highlights the practical value of meaningful representations as auxiliary objectives for scaling offline, goal-conditioned policies to longer horizons and more complex environments.

Abstract

While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally correlated states are properly encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. We formalize this notion by demonstrating how encouraging long-range temporal consistency via successor representations (SR) can facilitate generalization. We then propose a simple yet effective representation learning objective, for GCBC, which theoretically approximates the successor representation in the finite MDP case through self-predictive representations, and achieves competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.

Paper Structure

This paper contains 41 sections, 1 theorem, 34 equations, 6 figures, 8 tables.

Key Result

Theorem 4.1

Given a finite MDP with linear representations $\Phi \in \mathbb{R}^{|\mathcal{S}| \times d}$, and predictor $\Psi \in \mathbb{R}^{d\times d}$, under assumptions of orthogonal initialization for $\Phi$ (Ass. ass:orthinit), a uniform initial state distribution $p_0(s)$ (Ass. ass:uniforminit), and sym

Figures (6)

  • Figure 1: (\ref{['fig:predictive']}) Self-predictive Representations. We consider training on trajectories like, $s_0 \rightarrow s_h$ and $s_b \rightarrow s_f$, which intersect at $w$, and then evaluate evaluate on a task like $s_0 \rightarrow s_f$, requiring combinatorial generalization. (\ref{['fig:byolg-diagram']}) Representation learning with $\textbf{BYOL-}\boldsymbol{\gamma}$. We predict future state representations $\phi(s_{t+k})$ via $\psi_f(\phi(s_t), a)$, and also predict backwards with $\psi_b(\phi(s_{t+k}))$. The target offset is sampled geometrically: $k \sim \text{geom}(1 - \gamma)$. Stop-gradients are denoted by //. We provide more details on the training procedure $\mathcal{L}$ in Section \ref{['sec:byol-policy']}.
  • Figure 2: Visualization of the Learned Representation: depicts the similarity between the prediction of the current state representation to the goal representation. For $\textbf{BYOL-}\boldsymbol{\gamma}$ and TD-SR, we visualize the cosine similarity between $\psi(\phi(s), \cdot)$ or $\psi(s, \cdot)$, to $\phi(\color{red}g\color{black}) ~ \forall s \in D$ for a fixed goal $g$ which is indicated by the star marked in red.
  • Figure 3: Evaluating Generalization with Increasing Horizons: shows that $\textbf{BYOL-}\boldsymbol{\gamma}$ not only performs well on goals in the near horizon, but also, helps to generalize well to goals requiring stitching, after the red bar ($> 4$).
  • Figure 4: Encoder Variation. When training with BYOL, $\text{BYOL-}\gamma$ and TD-SR, we utilize policies with architecture (a) which uses $\phi$ to process states and goals. We utilize architecture (b) for TRA to match prior implementation, however in Appendix \ref{['appendix:action-cond']} we train TRA with architecture (a) and action-conditioning.
  • Figure 5: Evaluating Generalization with Increasing Horizons: The distances to the right of the red dotted line require combinatorial generalization. The maze maps show examples of how intermediate goals are selected along the optimal path.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 4.1