Table of Contents
Fetching ...

Using Offline Data to Speed Up Reinforcement Learning in Procedurally Generated Environments

Alain Andres, Lukas Schäfer, Stefano V. Albrecht, Javier Del Ser

TL;DR

The paper tackles RL generalization and sample efficiency in procedurally generated environments by leveraging offline demonstration data through imitation learning. It systematically compares pre-training and concurrent IL with online RL, using PPO for on-policy optimization and BC for imitation, on MiniGrid and Procgen benchmarks. Key findings show that even a small, diverse set of offline demonstrations can drastically reduce required interactions, with diversity often outweighing demonstration optimality; concurrent IL provides robustness when offline data is limited. The work highlights practical implications for robotics and industrial automation, where interaction costs are high, and suggests future directions in diversity-aware data collection and curriculum-like IL strategies to further improve generalization in PCG tasks.

Abstract

One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.

Using Offline Data to Speed Up Reinforcement Learning in Procedurally Generated Environments

TL;DR

The paper tackles RL generalization and sample efficiency in procedurally generated environments by leveraging offline demonstration data through imitation learning. It systematically compares pre-training and concurrent IL with online RL, using PPO for on-policy optimization and BC for imitation, on MiniGrid and Procgen benchmarks. Key findings show that even a small, diverse set of offline demonstrations can drastically reduce required interactions, with diversity often outweighing demonstration optimality; concurrent IL provides robustness when offline data is limited. The work highlights practical implications for robotics and industrial automation, where interaction costs are high, and suggests future directions in diversity-aware data collection and curriculum-like IL strategies to further improve generalization in PCG tasks.

Abstract

One of the key challenges of Reinforcement Learning (RL) is the ability of agents to generalise their learned policy to unseen settings. Moreover, training RL agents requires large numbers of interactions with the environment. Motivated by the recent success of Offline RL and Imitation Learning (IL), we conduct a study to investigate whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments. We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data. We analyse the impact of the quality (optimality of trajectories) and diversity (number of trajectories and covered level) of available offline trajectories on the effectiveness of both approaches. Across four well-known sparse reward tasks in the MiniGrid environment, we find that using IL for pre-training and concurrently during online RL training both consistently improve the sample-efficiency while converging to optimal policies. Furthermore, we show that pre-training a policy from as few as two trajectories can make the difference between learning an optimal policy at the end of online training and not learning at all. Our findings motivate the widespread adoption of IL for pre-training and concurrent IL in procedurally generated environments whenever offline trajectories are available or can be generated.
Paper Structure (49 sections, 4 equations, 22 figures, 3 tables)

This paper contains 49 sections, 4 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Two different levels of O1Dlhb (top) and MN12S10 (bottom) tasks from the MiniGrid benchmark. The agent has only access to the bright area highlighted at its front. Variations in the agent's spawn position, door colors, and target locations across levels give rise to the procedurally generated content.
  • Figure 2: Different levels of Ninja (left) and Climber (right) tasks from the Procgen benchmark. Variations in game assets such as bombs, platform configurations, and background underscore the PCG elements in each task.
  • Figure 3: Training performance (blue) and testing performance (orange) given by episodic returns depending on the number of levels used during training across multiple PCG tasks. The solid lines represent the average return, while the shaded areas indicate the standard deviation. Testing performance is evaluated on levels that were not seen during the training phase. For Procgen tasks Ninja and Climber (data reproduced from cobbe_leveraging_2020), shown on the left, the curves were obtained by training the agent with PPO for 200M time steps. In contrast, for Minigrid tasks O1Dlhb, O2Dlh, MN7S8, and MN12S10, the curves were derived from training the agent with an approach that combines IL with Intrinsic Motivation andres_towards_2022 for 10M to 20M time steps.
  • Figure 4: Our proposed evaluation framework. On the left, we train an agent with RAPID to collect datasets of varying quality. On the top right, we use IL just to pre-train a policy which is then used as initialization for the RL training. Alternatively, on the bottom right, we concurrently train the policy with RL and IL by initializing the buffer with the offline collected demonstrations.
  • Figure 5: Performance of the agent when pre-training with IL before the RL training phase in O1Dlhb (top) and MN12S10 (bottom). The horizontal dashed lines represent the pre-trained policies' return over the entire distribution of levels --trained solely with BC--that serve as initialization point for the training phase. Depending the task and the demonstrations used, the employed number of pre-training optimization steps (3,000 or 10,000) affects more/less the performance. Notice the x-axis provides the number of interactions/steps of the agent (after the pre-training phase).
  • ...and 17 more figures