Table of Contents
Fetching ...

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Dipan Maity

TL;DR

SAFE addresses persistent instability in RLHF by integrating three stabilized layers: value estimation with a Double Soft-Min Critic to inject pessimism and reduce overestimation; entropy-aware predictive KL control that differentiates exploration from mode collapse with asymmetric penalties and entropy gating; and reward-driven, PID-based adaptive thresholds that tune divergence constraints across training phases. Empirical results on a 3B parameter model show SAFE increases mean reward to $0.725$ with substantially reduced reward volatility and tighter KL dynamics compared to PPO, while adding minimal GPU overhead. The approach provides an interpretable, crash-resistant RLHF framework that supports aggressive learning speed yet stabilizes long-horizon optimization for production deployments. These findings suggest that multi-layer, adaptive control across value, policy, and temporal dynamics is essential for robust RLHF, and they point toward further systematic ablations and broader-scale evaluations to validate generalization. $V_{\text{soft}}(s) = -\alpha \log\left[\frac{1}{2}\left(e^{-V_1(s)/\alpha} + e^{-V_2(s)/\alpha}\right)\right]$, $L_{\text{EPC}} = g_t L_{\text{base}}$, and $\tau_t = (\tau_{\text{base}} + \text{PID}(r_t)) \cdot \phi_t$ are among the key formulations enabling stable on-policy RLHF. $\,$

Abstract

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

TL;DR

SAFE addresses persistent instability in RLHF by integrating three stabilized layers: value estimation with a Double Soft-Min Critic to inject pessimism and reduce overestimation; entropy-aware predictive KL control that differentiates exploration from mode collapse with asymmetric penalties and entropy gating; and reward-driven, PID-based adaptive thresholds that tune divergence constraints across training phases. Empirical results on a 3B parameter model show SAFE increases mean reward to with substantially reduced reward volatility and tighter KL dynamics compared to PPO, while adding minimal GPU overhead. The approach provides an interpretable, crash-resistant RLHF framework that supports aggressive learning speed yet stabilizes long-horizon optimization for production deployments. These findings suggest that multi-layer, adaptive control across value, policy, and temporal dynamics is essential for robust RLHF, and they point toward further systematic ablations and broader-scale evaluations to validate generalization. , , and are among the key formulations enabling stable on-policy RLHF.

Abstract

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE
Paper Structure (84 sections, 37 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 84 sections, 37 equations, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: Training dynamics comparing asymmetric KL control ( blue) versus PPO (red) over 2000 steps. Asymmetric control achieves stronger KL regulation but exhibits higher value loss volatility, indicating the need for additional stabilization (Section \ref{['sec:Entropy-Aware']}).
  • Figure 2: Training dynamics of SAFE. Top row: reward trajectory, KL divergence with adaptive threshold, and value loss. Bottom row: policy entropy with entropy floor, completion length, and smoothed reward. The controller maintains entropy above the configured floor while dynamically regulating KL magnitude.
  • Figure 3: Training dynamics and stability comparison between SAFE and PPO.Top row: Reward trajectory with confidence intervals, KL divergence with adaptive thresholds, and value function loss. Middle row: Reward distribution, KL distribution, and completion length evolution. Bottom row: Reward--KL trade-off scatter, cumulative reward accumulation, rolling reward stability, reward box plots, KL box plots, and batch-level reward variance. SAFE exhibits tighter reward confidence bands, improved reward stability, controlled KL excursions, and smoother cumulative reward growth compared to PPO, indicating improved robustness under long-horizon RLHF optimization.
  • Figure 4: GPU memory and runtime comparison between SAFE and PPO.Left: GPU peak memory usage per training step over 2,000 iterations. Both methods exhibit similar memory profiles with comparable peak allocations ($\sim$54 GB). Center: Wall-clock time per training step. SAFE maintains slightly faster step times with reduced variance, indicating that the additional control logic does not introduce computational bottlenecks. Right: Summary comparison of initialization memory, peak memory, and average step time. SAFE achieves near-identical resource usage with $-0.9\%$ memory overhead and $-1.4\%$ time overhead, demonstrating that the multi-layer stabilization framework is computationally efficient.