Matrix Low-Rank Trust Region Policy Optimization
Sergio Rozada, Antonio G. Marques
TL;DR
This work tackles the efficiency bottleneck of policy-based RL by replacing neural-network policies with low-rank, matrix-factorized representations for both the policy and value function within the TRPO framework. By modeling Gaussian policy parameters as products of low-rank matrices, the method reduces parameter counts while preserving performance, and maintains the monotonic improvement guarantee through the TRPO KL constraint. Empirical results on continuous-control tasks show faster convergence and substantial parameter savings with competitive returns compared to NN-based TRPO. The approach suggests that structured, low-rank representations can enhance the scalability of policy-based RL, with potential extensions to higher-dimensional tasks using tensor methods.
Abstract
Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.
