Table of Contents
Fetching ...

Improving Generalization on the ProcGen Benchmark with Simple Architectural Changes and Scale

Andrew Jesson, Yiding Jiang

TL;DR

It is demonstrated that recent advances in reinforcement learning (RL) combined with simple architectural changes significantly improves generalization on the ProcGen benchmark, and the results suggest that further exploration in this direction could yield substantial improvements in addressing generalization challenges in deep reinforcement learning.

Abstract

We demonstrate that recent advances in reinforcement learning (RL) combined with simple architectural changes significantly improves generalization on the ProcGen benchmark. These changes are frame stacking, replacing 2D convolutional layers with 3D convolutional layers, and scaling up the number of convolutional kernels per layer. Experimental results using a single set of hyperparameters across all environments show a 37.9\% reduction in the optimality gap compared to the baseline (from 0.58 to 0.36). This performance matches or exceeds current state-of-the-art methods. The proposed changes are largely orthogonal and therefore complementary to the existing approaches for improving generalization in RL, and our results suggest that further exploration in this direction could yield substantial improvements in addressing generalization challenges in deep reinforcement learning.

Improving Generalization on the ProcGen Benchmark with Simple Architectural Changes and Scale

TL;DR

It is demonstrated that recent advances in reinforcement learning (RL) combined with simple architectural changes significantly improves generalization on the ProcGen benchmark, and the results suggest that further exploration in this direction could yield substantial improvements in addressing generalization challenges in deep reinforcement learning.

Abstract

We demonstrate that recent advances in reinforcement learning (RL) combined with simple architectural changes significantly improves generalization on the ProcGen benchmark. These changes are frame stacking, replacing 2D convolutional layers with 3D convolutional layers, and scaling up the number of convolutional kernels per layer. Experimental results using a single set of hyperparameters across all environments show a 37.9\% reduction in the optimality gap compared to the baseline (from 0.58 to 0.36). This performance matches or exceeds current state-of-the-art methods. The proposed changes are largely orthogonal and therefore complementary to the existing approaches for improving generalization in RL, and our results suggest that further exploration in this direction could yield substantial improvements in addressing generalization challenges in deep reinforcement learning.

Paper Structure

This paper contains 7 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of 3D convolution vs 2D convolution. The 3D convolution is able to process the context at a finer temporal granularity whereas the 2D convolution must process all the stacked temporal information at the same time. The blue tensor is the input, the orange tensor is the convolutional kernel, and the green tensor is the resulting output. 3D convolution can have multiple output channels too but it is not shown in this figure as visualizing a 4D tensor is very hard. This visualization is created by user ashenoy at Stack Exchange (https://ai.stackexchange.com/questions/13692/when-should-i-use-3d-convolutions) under a CC BY-SA license.
  • Figure 2: Efficacy. Aggregate test level evaluation metrics for average episodic return over the last 100 steps. Compared to VSOP, VSOP-3D is given 8 frames instead of 1 frame and uses 3D instead of 2D convolutions. VSOP-3D+ scales VSOP-3D by increasing the number of frames to 16 and doubling the number of convolutional channels in each layer. We observe considerable improvements in all evaluation metrics considered. Comparing VSPO-3D+ to baseline VSOP, we observe a 65.9% increase in the Median (from 0.44 to 0.75), a 62.8% increase in the IQM (from 0.43 to 0.70), a 52.5% increase in the Mean (from 0.42 to 0.64), and a 37.9% decrease in the Optimality Gap (from 0.58 to 0.36).
  • Figure 3: Test episodic return curves aggregated over 5 random seeds for each ProcGen environment. VSOP-3D+ in orange, VSOP-3D in green, VSOP in yellow, PPO-3D in purple and PPO in blue. In most environments, VSOP-3D outperforms the comparisons and VSOP-3D+ improves upon VSOP-3D significantly. The only exceptions are Jumper and Caveflyer where the methods perform on par with base VSOP and Plunger where the proposed methods underperform base VSOP.