Table of Contents
Fetching ...

Mode-Dependent Rectification for Stable PPO Training

Mohamad Mohamad, Francesco Ponzio, Xavier Descombes

TL;DR

Mode-Dependent Rectification tackles PPO instability caused by mode-dependent layers like Batch Normalization, which induce policy mismatch between training and evaluation and lead to distribution shifts and reward collapse. The authors formulate the instability in terms of a perturbation to the PPO objective’s trust region, and propose MDR, a two-phase training procedure that interleaves standard updates with a deterministic rectification phase to realign training and evaluation behavior without changing architecture. They validate MDR across Procgen game environments and real-world patch-localization tasks, showing improved stability and performance for BatchNorm and applicability to dropout, with entropy regularization helping during rectification. The findings suggest that MDR is a robust, architecture-agnostic method to stabilize on-policy RL in the presence of mode-dependent components, potentially enabling broader use of normalization and regularization techniques in PPO.

Abstract

Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.

Mode-Dependent Rectification for Stable PPO Training

TL;DR

Mode-Dependent Rectification tackles PPO instability caused by mode-dependent layers like Batch Normalization, which induce policy mismatch between training and evaluation and lead to distribution shifts and reward collapse. The authors formulate the instability in terms of a perturbation to the PPO objective’s trust region, and propose MDR, a two-phase training procedure that interleaves standard updates with a deterministic rectification phase to realign training and evaluation behavior without changing architecture. They validate MDR across Procgen game environments and real-world patch-localization tasks, showing improved stability and performance for BatchNorm and applicability to dropout, with entropy regularization helping during rectification. The findings suggest that MDR is a robust, architecture-agnostic method to stabilize on-policy RL in the presence of mode-dependent components, potentially enabling broader use of normalization and regularization techniques in PPO.

Abstract

Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
Paper Structure (21 sections, 12 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Reward collapse. Training curves when BatchNorm operates in training mode during optimization and evaluation mode during data collection. Rewards initially improve rapidly; as training progresses, the mismatch $\Delta \pi_k^-$ increases. Beyond a critical point, this growing mismatch coincides with a sudden collapse in performance.
  • Figure 2: Effect of $\delta r$ and $\Delta\epsilon$ on PPO clipping. Left: Illustration of how a perturbation $\Delta \epsilon$ enlarges the effective clipping range of the PPO objective. Right: Clipping saturation as a function of the original ratio $r$. Blue points correspond to clipping under the unperturbed ratio $r$ ($\delta r = 0$), while red points show clipping under the perturbed ratio $r'$ with bounded noise $\delta r \in {0.05, 0.10, 0.15}$. Increasing $\delta r$ progressively expands the effective clipping boundaries.
  • Figure 3: Performance comparison across patch-localization tasks (top) and Procgen games (bottom). Top: normalized reward expressed as a percentage of the optimal policy reward for natural-image and histopathology environments with 256 and 1024 environments. Bottom: average episode return (Score) on six Procgen games (500 environments). Shaded regions denote one standard deviation across three seeds.
  • Figure 4: Dropout comparison on Procgen (500 easy levels). Solid lines denote training score, while dashed lines denote evaluation on held-out levels. Dropout is applied at 10%.
  • Figure 5: Dropout comparison on 2048 histopathology environments. In addition to the variants shown previously, we include an additional model with 20% dropout.
  • ...and 5 more figures