Table of Contents
Fetching ...

Reduce Computational Cost In Deep Reinforcement Learning Via Randomized Policy Learning

Zhuochen Liu, Rahul Jain, Quan Nguyen

TL;DR

The paper tackles the high computational cost of deep reinforcement learning in continuous MDPs by proposing RANDPOL, an actor-critic that uses randomized neural networks, training only the last-layer weights. It represents both value and policy with random feature expansions and optimizes via policy gradient with advantage baselines, achieving comparable performance to PPO but with reduced wall-clock time. The method is validated on four OpenAI Gym benchmarks and on a 12-motor Unitree Go1 quadruped in simulation, including domain randomization and parallelized training, showing faster convergence in wall clock and robust learning. This work demonstrates that randomized function approximation can substantially decrease training time in deep RL while maintaining accuracy, enabling more practical real-time control applications.

Abstract

Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural networks to drastically reduce computational costs while maintaining strong performance. Despite its simple architecture, our method effectively solves a range of control problems, including the locomotion control of a highly dynamic 12-motor quadruped robot, and achieves results comparable to leading algorithms such as Proximal Policy Optimization (PPO). Notably, our approach does not outperform other algorithms in terms of sample efficnency but rather in terms of wall-clock training time. That is, although our algorithm requires more timesteps to converge to an optimal policy, the actual time required for training turns out to be lower.

Reduce Computational Cost In Deep Reinforcement Learning Via Randomized Policy Learning

TL;DR

The paper tackles the high computational cost of deep reinforcement learning in continuous MDPs by proposing RANDPOL, an actor-critic that uses randomized neural networks, training only the last-layer weights. It represents both value and policy with random feature expansions and optimizes via policy gradient with advantage baselines, achieving comparable performance to PPO but with reduced wall-clock time. The method is validated on four OpenAI Gym benchmarks and on a 12-motor Unitree Go1 quadruped in simulation, including domain randomization and parallelized training, showing faster convergence in wall clock and robust learning. This work demonstrates that randomized function approximation can substantially decrease training time in deep RL while maintaining accuracy, enabling more practical real-time control applications.

Abstract

Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural networks to drastically reduce computational costs while maintaining strong performance. Despite its simple architecture, our method effectively solves a range of control problems, including the locomotion control of a highly dynamic 12-motor quadruped robot, and achieves results comparable to leading algorithms such as Proximal Policy Optimization (PPO). Notably, our approach does not outperform other algorithms in terms of sample efficnency but rather in terms of wall-clock training time. That is, although our algorithm requires more timesteps to converge to an optimal policy, the actual time required for training turns out to be lower.

Paper Structure

This paper contains 7 sections, 8 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Performance comparison in training time of each algorithm.
  • Figure 2: Performance comparison in the number of timesteps of each algorithm.
  • Figure 3: Comparison of algorithm performance in training a quadruped robot controller. RANDPOL achieves the highest cumulative reward and collects most simulated samples per unit time without losing much sample efficiency.