A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Khimya Khetarpal; Zhaohan Daniel Guo; Bernardo Avila Pires; Yunhao Tang; Clare Lyle; Mark Rowland; Nicolas Heess; Diana Borsa; Arthur Guez; Will Dabney

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Khimya Khetarpal, Zhaohan Daniel Guo, Bernardo Avila Pires, Yunhao Tang, Clare Lyle, Mark Rowland, Nicolas Heess, Diana Borsa, Arthur Guez, Will Dabney

TL;DR

BYOL-AC learns representations that capture spectral information of per-action transition dynamics $T_a$, whereas BYOL-$\Pi$ targets spectral information in $T^{\pi}$. The authors introduce a variance-like BYOL-VAR and derive a variance relation linking the three objectives within an ODE framework, along with two unifying lenses: a model-based view of low-rank dynamics and a model-free view of 1-step value/Q/advantage functions. Empirically, BYOL-AC often yields superior representations in linear settings and in Minigrid/control domains, while BYOL-VAR provides insights into action-distinguishing features. This framework guides the design of self-predictive objectives for robust RL representations and clarifies how different objectives emphasize distinct aspects of environment dynamics.

Abstract

Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Self-predictive learning provides means to jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model for self-predictive representation learning under the simplifying assumption that the algorithm depends on a fixed policy (BYOL-$Π$); this assumption is at odds with practical instantiations of such algorithms, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework, characterizing its convergence properties and highlighting important distinctions between the limiting solutions of the BYOL-$Π$ and BYOL-AC dynamics. We show how the two representations are related by a variance equation. This connection leads to a novel variance-like action-conditional objective (BYOL-VAR) and its corresponding ODE. We unify the study of all three objectives through two complementary lenses; a model-based perspective, where each objective is shown to be equivalent to a low-rank approximation of certain dynamics, and a model-free perspective, which establishes relationships between the objectives and their respective value, Q-value, and advantage function. Our empirical investigations, encompassing both linear function approximation and Deep RL environments, demonstrates that BYOL-AC is better overall in a variety of different settings.

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

TL;DR

BYOL-AC learns representations that capture spectral information of per-action transition dynamics

, whereas BYOL-

targets spectral information in

. The authors introduce a variance-like BYOL-VAR and derive a variance relation linking the three objectives within an ODE framework, along with two unifying lenses: a model-based view of low-rank dynamics and a model-free view of 1-step value/Q/advantage functions. Empirically, BYOL-AC often yields superior representations in linear settings and in Minigrid/control domains, while BYOL-VAR provides insights into action-distinguishing features. This framework guides the design of self-predictive objectives for robust RL representations and clarifies how different objectives emphasize distinct aspects of environment dynamics.

Abstract

); this assumption is at odds with practical instantiations of such algorithms, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework, characterizing its convergence properties and highlighting important distinctions between the limiting solutions of the BYOL-

and BYOL-AC dynamics. We show how the two representations are related by a variance equation. This connection leads to a novel variance-like action-conditional objective (BYOL-VAR) and its corresponding ODE. We unify the study of all three objectives through two complementary lenses; a model-based perspective, where each objective is shown to be equivalent to a low-rank approximation of certain dynamics, and a model-free perspective, which establishes relationships between the objectives and their respective value, Q-value, and advantage function. Our empirical investigations, encompassing both linear function approximation and Deep RL environments, demonstrates that BYOL-AC is better overall in a variety of different settings.

Paper Structure (42 sections, 24 theorems, 74 equations, 6 figures, 3 tables)

This paper contains 42 sections, 24 theorems, 74 equations, 6 figures, 3 tables.

Introduction
Preliminaries
BYOL ODE with Fixed Policy: BYOL-$\Pi$
Understanding Action-Conditional BYOL
The Action-Conditional BYOL Objective
Comparing the Representation Learned by BYOL-$\Pi$ and BYOL-AC
Variance-Like Action-Conditional BYOL
Two Unifying Perspectives: Model-Based and Model-Free
Fitting Dynamics - A Model-Based View
Fitting Value Functions - A Model-Free View
Experiments
Linear Function Approximation
Deep Reinforcement Learning
Discussion
Related Work
...and 27 more sections

Key Result

Lemma 1

Under ass:orthogonal-init, we have that $\Phi^T \dot{\Phi} = 0$, which means that $\Phi^T \Phi = I$ is preserved for all $\Phi$ throughout the ODE process.

Figures (6)

Figure 1: On the representations across BYOL-$\Pi$, BYOL-AC, and BYOL-VAR. We consider a simple MDP with two actions and corresponding transition functions $T_{a_0}, T_{a_1}$, with the eigenvalues of each action depicted in two leftmost plots. The middle plot shows a stacked bar plot of the trace objective values corresponding to each objective. The three rightmost plot shows each objective picking its top-$k$ ($k=4$) eigenvectors.
Figure 2: Comparing BYOL-$\Pi$, BYOL-AC, and BYOL-VAR augmented with a V-MPO agent in Minigrid. $\Phi_\text{ac}$ is overall better than $\Phi$, whereas $\Phi_\text{var}$ is a weak baseline and struggles.
Figure 3: BYOL-AC (orange) is overall better when compared to BYOL-$\Pi$ (blue).
Figure 4: High Level Architecture of our RL Agent. Network details are in \ref{['appsec:minigrid', 'appsec:openaigym']}
Figure 5: Comparing BYOL-$\Pi$, BYOL-AC, and BYOL-VAR on different domains in Minigrid across varying prediction horizons $H= 1, 4, 16$.
...and 1 more figures

Theorems & Definitions (38)

Lemma 1: Non-collapse, tang2022understanding
Lemma 2: BYOL Trace Objective, tang2022understanding
Theorem 1: BYOL-$\Pi$ ODE, tang2022understanding
Lemma 3: Non-collapse BYOL-AC
Lemma 4: BYOL-AC Trace Objective
Theorem 2: BYOL-AC ODE
Remark 1: Variance Relation
Lemma 5: Non-collapse BYOL-VAR
Lemma 6: BYOL-VAR Trace Objective
Theorem 3: BYOL-VAR ODE
...and 28 more

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

TL;DR

Abstract

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (38)