Mode-Dependent Rectification for Stable PPO Training
Mohamad Mohamad, Francesco Ponzio, Xavier Descombes
TL;DR
Mode-Dependent Rectification tackles PPO instability caused by mode-dependent layers like Batch Normalization, which induce policy mismatch between training and evaluation and lead to distribution shifts and reward collapse. The authors formulate the instability in terms of a perturbation to the PPO objective’s trust region, and propose MDR, a two-phase training procedure that interleaves standard updates with a deterministic rectification phase to realign training and evaluation behavior without changing architecture. They validate MDR across Procgen game environments and real-world patch-localization tasks, showing improved stability and performance for BatchNorm and applicability to dropout, with entropy regularization helping during rectification. The findings suggest that MDR is a robust, architecture-agnostic method to stabilize on-policy RL in the presence of mode-dependent components, potentially enabling broader use of normalization and regularization techniques in PPO.
Abstract
Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
