SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Dipan Maity

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Dipan Maity

TL;DR

SAFE addresses persistent instability in RLHF by integrating three stabilized layers: value estimation with a Double Soft-Min Critic to inject pessimism and reduce overestimation; entropy-aware predictive KL control that differentiates exploration from mode collapse with asymmetric penalties and entropy gating; and reward-driven, PID-based adaptive thresholds that tune divergence constraints across training phases. Empirical results on a 3B parameter model show SAFE increases mean reward to $0.725$ with substantially reduced reward volatility and tighter KL dynamics compared to PPO, while adding minimal GPU overhead. The approach provides an interpretable, crash-resistant RLHF framework that supports aggressive learning speed yet stabilizes long-horizon optimization for production deployments. These findings suggest that multi-layer, adaptive control across value, policy, and temporal dynamics is essential for robust RLHF, and they point toward further systematic ablations and broader-scale evaluations to validate generalization. $V_{\text{soft}}(s) = -\alpha \log\left[\frac{1}{2}\left(e^{-V_1(s)/\alpha} + e^{-V_2(s)/\alpha}\right)\right]$, $L_{\text{EPC}} = g_t L_{\text{base}}$, and $\tau_t = (\tau_{\text{base}} + \text{PID}(r_t)) \cdot \phi_t$ are among the key formulations enabling stable on-policy RLHF. $\,$

Abstract

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

TL;DR

with substantially reduced reward volatility and tighter KL dynamics compared to PPO, while adding minimal GPU overhead. The approach provides an interpretable, crash-resistant RLHF framework that supports aggressive learning speed yet stabilizes long-horizon optimization for production deployments. These findings suggest that multi-layer, adaptive control across value, policy, and temporal dynamics is essential for robust RLHF, and they point toward further systematic ablations and broader-scale evaluations to validate generalization.

, and

are among the key formulations enabling stable on-policy RLHF.

Abstract

Paper Structure (84 sections, 37 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 84 sections, 37 equations, 4 figures, 4 tables, 4 algorithms.

Introduction
Contributions.
Background
RLHF Pipeline
Log-Probability Ratio Estimation in RLHF
Double Critic Learning and Pessimistic Aggregation
Entropy Regularization and Exploration Stability
Method:
Value-Level Stabilization via Pessimistic Aggregation.
Policy-Level Stabilization via Entropy-Aware Predictive Control.
Motivation: Late-Training Entropy Collapse in RLHF
Asymmetric Controller Design
Asymmetric Divergence Penalty.
Momentum-Based Early Warning.
Combined Control Signal.
...and 69 more sections

Figures (4)

Figure 1: Training dynamics comparing asymmetric KL control ( blue) versus PPO (red) over 2000 steps. Asymmetric control achieves stronger KL regulation but exhibits higher value loss volatility, indicating the need for additional stabilization (Section \ref{['sec:Entropy-Aware']}).
Figure 2: Training dynamics of SAFE. Top row: reward trajectory, KL divergence with adaptive threshold, and value loss. Bottom row: policy entropy with entropy floor, completion length, and smoothed reward. The controller maintains entropy above the configured floor while dynamically regulating KL magnitude.
Figure 3: Training dynamics and stability comparison between SAFE and PPO.Top row: Reward trajectory with confidence intervals, KL divergence with adaptive thresholds, and value function loss. Middle row: Reward distribution, KL distribution, and completion length evolution. Bottom row: Reward--KL trade-off scatter, cumulative reward accumulation, rolling reward stability, reward box plots, KL box plots, and batch-level reward variance. SAFE exhibits tighter reward confidence bands, improved reward stability, controlled KL excursions, and smoother cumulative reward growth compared to PPO, indicating improved robustness under long-horizon RLHF optimization.
Figure 4: GPU memory and runtime comparison between SAFE and PPO.Left: GPU peak memory usage per training step over 2,000 iterations. Both methods exhibit similar memory profiles with comparable peak allocations ($\sim$54 GB). Center: Wall-clock time per training step. SAFE maintains slightly faster step times with reduced variance, indicating that the additional control logic does not introduce computational bottlenecks. Right: Summary comparison of initialization memory, peak memory, and average step time. SAFE achieves near-identical resource usage with $-0.9\%$ memory overhead and $-1.4\%$ time overhead, demonstrating that the multi-layer stabilization framework is computationally efficient.

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

TL;DR

Abstract

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Authors

TL;DR

Abstract

Table of Contents

Figures (4)