Table of Contents
Fetching ...

On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations

Rajdeep Singh Hundal, Yan Xiao, Xiaochun Cao, Jin Song Dong, Manuel Rigger

TL;DR

This work reveals that deep reinforcement learning (DRL) implementations of the same algorithm are not interchangeable. Using differential testing across five PPO implementations on 56 Atari environments, the study finds substantial performance discrepancies driven largely by code-level inconsistencies, such as frames-per-episode handling and API differences. By combining SBCI with robust statistics, the authors quantify the prevalence of discrepancies (RQ1), investigate root causes (RQ2), and demonstrate that interchangeability assumptions can flip study outcomes (RQ3). They advocate for replicability tests, differential testing as a standard practice, and large, diverse environment suites to ensure reliable conclusions in DRL research. The findings have practical implications for researchers and practitioners, highlighting the need for explicit documentation of implementation details and ongoing cross-implementation validation.

Abstract

Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.

On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations

TL;DR

This work reveals that deep reinforcement learning (DRL) implementations of the same algorithm are not interchangeable. Using differential testing across five PPO implementations on 56 Atari environments, the study finds substantial performance discrepancies driven largely by code-level inconsistencies, such as frames-per-episode handling and API differences. By combining SBCI with robust statistics, the authors quantify the prevalence of discrepancies (RQ1), investigate root causes (RQ2), and demonstrate that interchangeability assumptions can flip study outcomes (RQ3). They advocate for replicability tests, differential testing as a standard practice, and large, diverse environment suites to ensure reliable conclusions in DRL research. The findings have practical implications for researchers and practitioners, highlighting the need for explicit documentation of implementation details and ongoing cross-implementation validation.

Abstract

Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.

Paper Structure

This paper contains 53 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Applying SBCI to all trials from a configuration.
  • Figure 1: Training curves from five PPO implementations across all 56 environments where the y-axis and x-axis represent the mean reward and number of frames respectively. Five agents were trained for each (implementation, environment) permutation and the training curves were aggregated to display the mean, minimum, and maximum within the shaded regions.
  • Figure 2: Training curves from five DQN implementations where the y-axis represents the mean in-game reward while the x-axis represents the number of in-game frames that have passed. Five agents were trained for each (implementation, game) permutation and the training curves were aggregated to display the mean, minimum, and maximum within the shaded regions. The mean reward attained by a professional human tester is also shown in black to gauge superhuman or subhuman capabilities.
  • Figure 2: Training curves from the high-performing PPO implementations across the nine environments they significantly differed in (after fixing the frames per episode inconsistency). The curves from ChopperCommand, Gopher, Robotank, and Zaxxon suggest that multiple undiscovered inconsistencies still exist between the implementations.
  • Figure 3: Frame stacking approaches taken by Stable Baselines3 (top-left) and Dopamine (bottom-left), as well as a comparison of their implementation of the $\epsilon$-greedy policy (right).
  • ...and 7 more figures