A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Khimya Khetarpal, Zhaohan Daniel Guo, Bernardo Avila Pires, Yunhao Tang, Clare Lyle, Mark Rowland, Nicolas Heess, Diana Borsa, Arthur Guez, Will Dabney
TL;DR
BYOL-AC learns representations that capture spectral information of per-action transition dynamics $T_a$, whereas BYOL-$\Pi$ targets spectral information in $T^{\pi}$. The authors introduce a variance-like BYOL-VAR and derive a variance relation linking the three objectives within an ODE framework, along with two unifying lenses: a model-based view of low-rank dynamics and a model-free view of 1-step value/Q/advantage functions. Empirically, BYOL-AC often yields superior representations in linear settings and in Minigrid/control domains, while BYOL-VAR provides insights into action-distinguishing features. This framework guides the design of self-predictive objectives for robust RL representations and clarifies how different objectives emphasize distinct aspects of environment dynamics.
Abstract
Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Self-predictive learning provides means to jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model for self-predictive representation learning under the simplifying assumption that the algorithm depends on a fixed policy (BYOL-$Π$); this assumption is at odds with practical instantiations of such algorithms, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework, characterizing its convergence properties and highlighting important distinctions between the limiting solutions of the BYOL-$Π$ and BYOL-AC dynamics. We show how the two representations are related by a variance equation. This connection leads to a novel variance-like action-conditional objective (BYOL-VAR) and its corresponding ODE. We unify the study of all three objectives through two complementary lenses; a model-based perspective, where each objective is shown to be equivalent to a low-rank approximation of certain dynamics, and a model-free perspective, which establishes relationships between the objectives and their respective value, Q-value, and advantage function. Our empirical investigations, encompassing both linear function approximation and Deep RL environments, demonstrates that BYOL-AC is better overall in a variety of different settings.
