The Role of Target Update Frequencies in Q-Learning
Simon Weissmann, Tilman Aach, Benedikt Wille, Sebastian Kassing, Leif Döring
TL;DR
This work analyzes target network maintenance in Q-learning through an approximate dynamic programming lens, treating periodic target updates as a nested Bellman-operator approximation with an inner SGD loop. It proves that fixed TUF schedules are provably suboptimal and derives a near-optimal increasing TUF strategy with geometry $K_n \propto \gamma^{-2n/3}$ that improves the sample complexity from $O\left(\frac{\log((1-\gamma)^{-1}\varepsilon^{-1})}{(1-\gamma)^4\xi^2\varepsilon^2}\right)$ to $O\left(\frac{1}{\xi^2(1-\gamma)^5\varepsilon^2}\right)$, effectively removing the logarithmic dependence on $\varepsilon$. The analysis combines a general outer-loop contraction with a detailed inner-loop SGD bound, under asynchronous sampling, yielding explicit bias-variance trade-offs for the target-update period. The results inform principled design of target-update schedules in tabular Q-learning and offer a foundation for extensions to deeper RL settings, including accuracy-triggered updates and adaptive optimizers. Overall, the paper provides both finite-time convergence guarantees and practical guidance for stabilizing and accelerating Q-learning through adaptive target updates.
Abstract
The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias-variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.
