Table of Contents
Fetching ...

Matrix Low-Rank Trust Region Policy Optimization

Sergio Rozada, Antonio G. Marques

TL;DR

This work tackles the efficiency bottleneck of policy-based RL by replacing neural-network policies with low-rank, matrix-factorized representations for both the policy and value function within the TRPO framework. By modeling Gaussian policy parameters as products of low-rank matrices, the method reduces parameter counts while preserving performance, and maintains the monotonic improvement guarantee through the TRPO KL constraint. Empirical results on continuous-control tasks show faster convergence and substantial parameter savings with competitive returns compared to NN-based TRPO. The approach suggests that structured, low-rank representations can enhance the scalability of policy-based RL, with potential extensions to higher-dimensional tasks using tensor methods.

Abstract

Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.

Matrix Low-Rank Trust Region Policy Optimization

TL;DR

This work tackles the efficiency bottleneck of policy-based RL by replacing neural-network policies with low-rank, matrix-factorized representations for both the policy and value function within the TRPO framework. By modeling Gaussian policy parameters as products of low-rank matrices, the method reduces parameter counts while preserving performance, and maintains the monotonic improvement guarantee through the TRPO KL constraint. Empirical results on continuous-control tasks show faster convergence and substantial parameter savings with competitive returns compared to NN-based TRPO. The approach suggests that structured, low-rank representations can enhance the scalability of policy-based RL, with potential extensions to higher-dimensional tasks using tensor methods.

Abstract

Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.
Paper Structure (5 sections, 11 equations, 1 figure, 1 algorithm)

This paper contains 5 sections, 11 equations, 1 figure, 1 algorithm.

Figures (1)

  • Figure 1: Median return per episode in 3 standard RL problems: (a) the pendulum, (b) the acrobot, and (c) the mountain car. The number of parameters of each model is shown in the legend. TRLRPO reaches the steady state faster than NN-TRPO in the pendulum, and mountain car problems, achieving a better return in the acrobot problem.