Table of Contents
Fetching ...

Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training

Ruoxing Yang

TL;DR

The paper tackles the challenge of sample- and compute-efficient reinforcement learning for physics-based simulations by introducing PPOPT, a model-free PPO variant that embeds a pretrained middle network. This network is pretrained in a related environment and transplanted into the PPOPT policy to transfer transferable physics knowledge, improving learning speed and stability under sparse experience. Empirical results show PPOPT outperforming baseline PPO in both Double Inverted Pendulum and Hopper tasks in terms of learning stability and final performance, while remaining significantly faster to train than DYNA DDPG, albeit at some accuracy tradeoffs. The work provides a practical, open-source approach to experience-efficient RL and outlines avenues for combining PPOPT with model-based ideas to further balance performance and training time.

Abstract

We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.

Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training

TL;DR

The paper tackles the challenge of sample- and compute-efficient reinforcement learning for physics-based simulations by introducing PPOPT, a model-free PPO variant that embeds a pretrained middle network. This network is pretrained in a related environment and transplanted into the PPOPT policy to transfer transferable physics knowledge, improving learning speed and stability under sparse experience. Empirical results show PPOPT outperforming baseline PPO in both Double Inverted Pendulum and Hopper tasks in terms of learning stability and final performance, while remaining significantly faster to train than DYNA DDPG, albeit at some accuracy tradeoffs. The work provides a practical, open-source approach to experience-efficient RL and outlines avenues for combining PPOPT with model-based ideas to further balance performance and training time.

Abstract

We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at github.com/Davidrxyang/PPOPT.

Paper Structure

This paper contains 25 sections, 1 equation, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: PPOPT network architecture
  • Figure 2: Results from Double Inverted Pendulum, excluding DYNA DDPG
  • Figure 3: Results from Double Inverted Pendulum
  • Figure 4: Results from Hopper, excluding DYNA DDPG
  • Figure 5: Results from Hopper