Table of Contents
Fetching ...

Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

Stefan Stojanovic, Yassir Jedra, Alexandre Proutiere

TL;DR

This work presents LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps, which achieves order-optimal sample complexity under milder conditions than those assumed in previously proposed approaches.

Abstract

We consider the problem of learning an $\varepsilon$-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an $\varepsilon$-optimal policy using $\widetilde{O}({S+A\over \mathrm{poly}(1-γ)\varepsilon^2})$ samples where $S$ (resp. $A$) denotes the number of states (resp. actions) and $γ$ the discount factor. Our algorithm achieves this order-optimal (in $S$, $A$ and $\varepsilon$) sample complexity under milder conditions than those assumed in previously proposed approaches.

Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

TL;DR

This work presents LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps, which achieves order-optimal sample complexity under milder conditions than those assumed in previously proposed approaches.

Abstract

We consider the problem of learning an -optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an -optimal policy using samples where (resp. ) denotes the number of states (resp. actions) and the discount factor. Our algorithm achieves this order-optimal (in , and ) sample complexity under milder conditions than those assumed in previously proposed approaches.

Paper Structure

This paper contains 44 sections, 19 theorems, 130 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

For any $\epsilon > 0$ and any $\tau \ge \frac{1}{1- \gamma} \log\left( \frac{r_{\max}}{(1- \gamma) \epsilon}\right)$, we have $\Vert Q^\pi - Q^\pi_\tau \Vert_{\infty} \le \epsilon$.

Figures (5)

  • Figure 1: Consider an MDP with two states and two actions (see Appendix \ref{['subsec:app_toy_experiment']} for details). The 4 black crosses correspond to the value function of the 4 possible policies. When combining policy iteration with a low rank estimation procedure, we just need to control the condition number of the 4 corresponding value matrices. The red dots correspond to the successive estimates $V^{(t)}$ of $V^\star$ when running value iteration. When applying a value iteration approach, we would need to upper bound the condition number of all the corresponding matrices $Q^{(t)}={\cal F}(V^{(t-1)})$ for $t\ge 1$. For a given $V$, the background color in the figure indicates the value of the condition number of ${\cal F}(V)$. We see that the dynamics of $V^{(t)}$ under the value iteration algorithm are such that the trajectory $(Q^{(t)}, t\ge 1)$ has to go through regions where the condition number is very high. Hence on this example, a value iteration approach would not work well.
  • Figure 2: Matrix completion: matrix $M^\star$ is of size $1000\times 1000$, rank $d=5$ and sampled entries have additive Gaussian noise with $\sigma = 0.01$. Number of anchors used was $K = 10$. All plots are averaged over $30$ simulations and a new random matrix $M^\star$ was generated in every $5$ simulations.
  • Figure 3: Matrix $Q^\star$ is obtained from rank $d=5$ rewards and transition matrices. Moreover, $S=70,A=50$, $\gamma = 0.9$, and we choose number of anchors $K = 15$. Observations are noisy with additive Gaussian noise with $\sigma = 0.01$. Plots are averaged over $100$ simulations, and new MDPs are generated every $5$ simulations, while the number of samples in an iteration $t$ is $10(1.1)^t$.
  • Figure 4: $\text{\tt LoRa-VI}$: $Q^\star$ generated from low-rank $r$ and $P$ of rank $d=4$, $S=A=1000$, $\gamma = 0.1$. We used $K = 10$ anchors, $V^{(0)}=0$, rewards are noisy with Gaussian noise $\sigma = 0.01$. All plots are averaged over $5$ simulations, each consisting of $50$ epochs, and the number of samples in an epoch $t$ is approximately $20 (1.05)^t (S+A)K$.
  • Figure 5: $\text{\tt LoRa-PI}$: $Q^\star$ generated from low-rank $r$ and $P$ of rank $d=4$, $S=A=1000$, $\gamma = 0.1$, $\tau = 5$. We used $K = 10$ anchors, uniformly random initial policy, and noisy rewards with Gaussian noise $\sigma = 0.01$. Plots for PI with anchors are averaged over $3$ simulations, while the one for full-matrix PI is simulated once. Each simulation consisted of $20$ epochs, and the number of samples in an epoch $t$ is approximately $10 (1.15)^t (S+A)K$.

Theorems & Definitions (32)

  • Definition 1: Rank of a policy, rank of the MDP
  • Definition 2: Leverage scores
  • Lemma 1
  • Proposition 1
  • Theorem 1: Leverage Scores Estimation
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Corollary 1
  • proof
  • ...and 22 more