SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF
Dipan Maity
TL;DR
SAFE addresses persistent instability in RLHF by integrating three stabilized layers: value estimation with a Double Soft-Min Critic to inject pessimism and reduce overestimation; entropy-aware predictive KL control that differentiates exploration from mode collapse with asymmetric penalties and entropy gating; and reward-driven, PID-based adaptive thresholds that tune divergence constraints across training phases. Empirical results on a 3B parameter model show SAFE increases mean reward to $0.725$ with substantially reduced reward volatility and tighter KL dynamics compared to PPO, while adding minimal GPU overhead. The approach provides an interpretable, crash-resistant RLHF framework that supports aggressive learning speed yet stabilizes long-horizon optimization for production deployments. These findings suggest that multi-layer, adaptive control across value, policy, and temporal dynamics is essential for robust RLHF, and they point toward further systematic ablations and broader-scale evaluations to validate generalization. $V_{\text{soft}}(s) = -\alpha \log\left[\frac{1}{2}\left(e^{-V_1(s)/\alpha} + e^{-V_2(s)/\alpha}\right)\right]$, $L_{\text{EPC}} = g_t L_{\text{base}}$, and $\tau_t = (\tau_{\text{base}} + \text{PID}(r_t)) \cdot \phi_t$ are among the key formulations enabling stable on-policy RLHF. $\,$
Abstract
Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE
