SQT -- std $Q$-target

Nitsan Soffair; Dotan Di-Castro; Orly Avner; Shie Mannor

SQT -- std $Q$-target

Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor

TL;DR

This work tackles overestimation bias in Q-learning by introducing Std $Q$-target (SQT), an uncertainty penalty derived from the disagreement among an ensemble of Q-networks. SQT is added to the standard $Q$-target used by TD3/TD7, yielding a new target: $y_t = r(s_t, a_t) + \gamma \mathcal{Q}[Q](s_{t+1}, \mu(s_{t+1}) | \theta^Q) - \alpha \cdot \textit{SQT}[\mathcal{B}]$, where $\textit{SQT}[\mathcal{B}] = \mathrm{mean}_{s \in \mathcal{B}} [ \mathrm{std}_{i=1...N} Q_i(s, a) ]$. Empirically, the method shows clear performance advantages over DDPG, TD3, and TD7 across seven MuJoCo/Bullet locomotion tasks, with improved stability and robustness due to the conservative updates. The paper situates SQT among related conservative and pessimistic Q-learning approaches and demonstrates that a minimal change to the $Q$-target can yield substantial gains in both sample efficiency and final performance, highlighting a practical, scalable path to mitigating overestimation bias in ensemble RL.

Abstract

Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

SQT -- std $Q$-target

TL;DR

This work tackles overestimation bias in Q-learning by introducing Std

-target (SQT), an uncertainty penalty derived from the disagreement among an ensemble of Q-networks. SQT is added to the standard

-target used by TD3/TD7, yielding a new target:

, where

. Empirically, the method shows clear performance advantages over DDPG, TD3, and TD7 across seven MuJoCo/Bullet locomotion tasks, with improved stability and robustness due to the conservative updates. The paper situates SQT among related conservative and pessimistic Q-learning approaches and demonstrates that a minimal change to the

-target can yield substantial gains in both sample efficiency and final performance, highlighting a practical, scalable path to mitigating overestimation bias in ensemble RL.

Abstract

Std

-target is a conservative, actor-critic, ensemble,

-learning-based algorithm, which is based on a single key

-formula:

-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's

-target formula superiority over TD3's

-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

Paper Structure (9 sections, 14 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 9 sections, 14 equations, 5 figures, 2 tables, 3 algorithms.

Introduction
Background
Overestimation bias
Underestimation bias
Std $Q$-target
Algorithm
Experiments
Related work
Conclusion

Figures (5)

Figure 1: SQT's architecture.
Figure 2: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on ant bullet.
Figure 3: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on cheetah bullet.
Figure 4: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on swimmer.
Figure 5: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on hopper-bullet.

SQT -- std $Q$-target

TL;DR

Abstract

SQT -- std $Q$-target

Authors

TL;DR

Abstract

Table of Contents

Figures (5)