SQT -- std $Q$-target
Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor
TL;DR
This work tackles overestimation bias in Q-learning by introducing Std $Q$-target (SQT), an uncertainty penalty derived from the disagreement among an ensemble of Q-networks. SQT is added to the standard $Q$-target used by TD3/TD7, yielding a new target: $y_t = r(s_t, a_t) + \gamma \mathcal{Q}[Q](s_{t+1}, \mu(s_{t+1}) | \theta^Q) - \alpha \cdot \textit{SQT}[\mathcal{B}]$, where $\textit{SQT}[\mathcal{B}] = \mathrm{mean}_{s \in \mathcal{B}} [ \mathrm{std}_{i=1...N} Q_i(s, a) ]$. Empirically, the method shows clear performance advantages over DDPG, TD3, and TD7 across seven MuJoCo/Bullet locomotion tasks, with improved stability and robustness due to the conservative updates. The paper situates SQT among related conservative and pessimistic Q-learning approaches and demonstrates that a minimal change to the $Q$-target can yield substantial gains in both sample efficiency and final performance, highlighting a practical, scalable path to mitigating overestimation bias in ensemble RL.
Abstract
Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.
