Table of Contents
Fetching ...

Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning

Bastien Dubail, Stefan Stojanovic, Alexandre Proutière

TL;DR

This work demonstrates that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions, and establishes a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift.

Abstract

Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not approximately low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by a newly introduced quantitity: the spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincaré inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.

Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning

TL;DR

This work demonstrates that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions, and establishes a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift.

Abstract

Low-rank structure is a common implicit assumption in many modern reinforcement learning (RL) algorithms. For instance, reward-free and goal-conditioned RL methods often presume that the successor measure admits a low-rank representation. In this work, we challenge this assumption by first remarking that the successor measure itself is not approximately low-rank. Instead, we demonstrate that a low-rank structure naturally emerges in the shifted successor measure, which captures the system dynamics after bypassing a few initial transitions. We provide finite-sample performance guarantees for the entry-wise estimation of a low-rank approximation of the shifted successor measure from sampled entries. Our analysis reveals that both the approximation and estimation errors are primarily governed by a newly introduced quantitity: the spectral recoverability of the corresponding matrix. To bound this parameter, we derive a new class of functional inequalities for Markov chains that we call Type II Poincaré inequalities and from which we can quantify the amount of shift needed for effective low-rank approximation and estimation. This analysis shows in particular that the required shift depends on decay of the high-order singular values of the shifted successor measure and is hence typically small in practice. Additionally, we establish a connection between the necessary shift and the local mixing properties of the underlying dynamical system, which provides a natural way of selecting the shift. Finally, we validate our theoretical findings with experiments, and demonstrate that shifting the successor measure indeed leads to improved performance in goal-conditioned RL.

Paper Structure

This paper contains 75 sections, 37 theorems, 196 equations, 11 figures.

Key Result

Lemma 1

Let $M\in \mathbb{R}^{n\times n}$. We have: for any $1\le r<n$, $\|M - [M]_r\|_{2,\infty}\le \sqrt{\sigma_{r+1}\xi(M)}$.

Figures (11)

  • Figure 1: The discrete Medium PointMaze environment (see Section \ref{['sec:experiments']}). Performance of goal-conditioned RL based on the rank-$r$ approximation of the $k$-shifted successor measure. Peak performance occurs at a non-zero shift, suggesting that shifting the successor measure can improve policy learning under low-rank constraints.
  • Figure 2: Approximation error as a function of the shift parameter $k$ and rank $r$. The theoretical upper bound serves as a first-order proxy for the entry-wise error. We use the standard $\Vert \cdot \Vert_{2 \to \infty}$ norm, which matches (up to a $\sqrt{n}$ factor) the variant from Section \ref{['subsec:norms_measure']} under the uniform measure $\nu$. See Section \ref{['sec:experiments']} for experimental details.
  • Figure 3: The four-room environment.
  • Figure 4: (a) Discrete Medium Pointmaze environment. Each state $s$ is colored by $\max_a \sum_{b\in \mathcal{A}} M_{\pi_{\mathcal{D}},k=1}(s,a,g,b)$, with $\gamma=0.95$, goal $g$ marked by a star, and actions follow a uniform policy $\pi_{\mathcal{D}}$. Arrows indicate the greedy policy $\pi(s\vert g) = \mathop{\mathrm{argmax}}\nolimits_a \sum_b M_{\pi_{\mathcal{D}},k=1}(s,a,g,b)$. (b) Singular values of shifted successor measures. (c–d) Accuracy (probability of reaching a random goal) and relaxed accuracy (reaching its $2$-neighborhood) as a function of rank and shift for true successor measures. (e–f) Same as (c–d), but for successor measures learned via TD. (g–h) Accuracy vs. number of trajectories of length $H=100$. Results are averaged over $100$ random goals and $5$ seeds.
  • Figure 5: Main steps of the proof of Theorem \ref{['thm:main_upper_bound']}.
  • ...and 6 more figures

Theorems & Definitions (75)

  • Definition 1: $k$-shifted successor measure
  • Definition 2: $\nu$-SVD
  • Definition 3: Spectral (ir)recoverability
  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Definition 4: Induced Markov chain
  • ...and 65 more