SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Carlo Romeo; Girolamo Macaluso; Alessandro Sestini; Andrew D. Bagdanov

SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

TL;DR

The paper tackles the scalability challenge of high update-to-data (UTD) reinforcement learning by introducing SPEQ, an offline stabilization framework that interleaves one-to-one online updates ($UTD=1$) with periodic offline phases. During offline phases, Q-functions are fine-tuned on a fixed replay buffer with dropout regularization to curb overestimation bias, using only two critics to remain computationally efficient. Empirically, SPEQ achieves 40%–99% fewer gradient updates and 27%–78% less training time than state-of-the-art high-UTD methods on MuJoCo while maintaining or improving performance. This demonstrates that periodic stabilization phases can outperform simply lowering the UTD ratio, offering a scalable approach for real-world RL where compute is constrained.

Abstract

High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data. This structured training schedule optimally balances computational and sample efficiency, addressing the limitations of both high and low UTD ratio approaches. We empirically demonstrate that SPEQ requires from 40% to 99% fewer gradient updates and 27% to 78% less training time compared to state-of-the-art high UTD ratio methods while maintaining or surpassing their performance on the MuJoCo continuous control benchmark. Our findings highlight the potential of periodic stabilization phases as an effective alternative to conventional training schedules, paving the way for more scalable reinforcement learning solutions in real-world applications where computational resources are constrained.

SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

TL;DR

The paper tackles the scalability challenge of high update-to-data (UTD) reinforcement learning by introducing SPEQ, an offline stabilization framework that interleaves one-to-one online updates (

) with periodic offline phases. During offline phases, Q-functions are fine-tuned on a fixed replay buffer with dropout regularization to curb overestimation bias, using only two critics to remain computationally efficient. Empirically, SPEQ achieves 40%–99% fewer gradient updates and 27%–78% less training time than state-of-the-art high-UTD methods on MuJoCo while maintaining or improving performance. This demonstrates that periodic stabilization phases can outperform simply lowering the UTD ratio, offering a scalable approach for real-world RL where compute is constrained.

Abstract

Paper Structure (6 sections, 7 figures, 1 table, 1 algorithm)

This paper contains 6 sections, 7 figures, 1 table, 1 algorithm.

Introduction
Related Work
Offline Stabilization Phases for Efficient Q-Learning (SPEQ)
Experimental Results
Conclusions
Ablation Study

Figures (7)

Figure 1: Comparison of state-of-the-art high UTD ratio RL approaches and SAC. This plot shows the performance averaged over four MuJoCo environments as a function of the total number of gradient steps (averaged over 5 random seeds). While high-UTD methods achieve strong final performance, they require significantly more gradient updates (and training time) compared to SAC. In contrast, SAC converges rapidly with far fewer updates, but its final performance remains limited.
Figure 2: Overview of SPEQ. (a) Classical online RL training with high UTD ratios. For each environment interaction, the agent is trained UTD times on the replay buffer. (b) Our approach (SPEQ) which separates the training of the agent into two distinct phases. In the online interaction phase (b.1), we update the agent only once before moving to the next environment step (equivalent to $\text{UTD}=1$). Every $F$ environment steps we switch to an offline stabilization phase (b.2) in which we fine-tune the agent Q-functions for $N$ optimization steps on the current replay buffer.
Figure 3: (a) Results of varying the number of gradient updates $N$ during offline stabilization on the MuJoCo Humanoid task, averaged over 5 random seeds. Offline stabilization phases are performed every $F=10,000$ environment steps. The plot shows that the performance improves by increasing the number of updates up to about 75K iterations, beyond which further updates result in diminishing returns. (b) Comparison of SPEQ to DroQ with varying UTD ratios. Increasing the UTD ratio in DroQ generally leads to improved performance. However, despite performing approximately the same number of gradient updates as SPEQ, DroQ with a UTD ratio of 9 results in significantly lower performance. These results indicate that reducing the UTD ratio alone significantly impacts DroQ's performance, whereas SPEQ offers a performant and computationally efficient solution.
Figure 4: Comparison of SPEQ against baseline and high-UTD methods. Each algorithm performs the same number of gradient updates as SPEQ to evaluate the performance per gradient step value, that is: how effective each gradient step is in increasing performance in a resource-constrained scenario. We observe that high-UTD methods fail to achieve competitive performance when constrained to a limited number of updates. While SAC performs better than the high-UTD approaches, it requires more environment interactions. On the other hand, SPEQ represent the best trade-off.
Figure 5: Comparison of SPEQ with DroQ at varying UTD ratios. We see that increasing the UTD ratio in DroQ generally leads to improved performance. However, despite performing approximately the same number of gradient updates as SPEQ, DroQ with $\text{UTD}=9$ results in significantly lower performance. This indicates that reducing the UTD ratio alone significantly impacts DroQ's performance, whereas SPEQ offers a more performant and computationally efficient solution.
...and 2 more figures

SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

TL;DR

Abstract

SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)