Table of Contents
Fetching ...

SQT -- std $Q$-target

Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor

TL;DR

This work tackles overestimation bias in Q-learning by introducing Std $Q$-target (SQT), an uncertainty penalty derived from the disagreement among an ensemble of Q-networks. SQT is added to the standard $Q$-target used by TD3/TD7, yielding a new target: $y_t = r(s_t, a_t) + \gamma \mathcal{Q}[Q](s_{t+1}, \mu(s_{t+1}) | \theta^Q) - \alpha \cdot \textit{SQT}[\mathcal{B}]$, where $\textit{SQT}[\mathcal{B}] = \mathrm{mean}_{s \in \mathcal{B}} [ \mathrm{std}_{i=1...N} Q_i(s, a) ]$. Empirically, the method shows clear performance advantages over DDPG, TD3, and TD7 across seven MuJoCo/Bullet locomotion tasks, with improved stability and robustness due to the conservative updates. The paper situates SQT among related conservative and pessimistic Q-learning approaches and demonstrates that a minimal change to the $Q$-target can yield substantial gains in both sample efficiency and final performance, highlighting a practical, scalable path to mitigating overestimation bias in ensemble RL.

Abstract

Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

SQT -- std $Q$-target

TL;DR

This work tackles overestimation bias in Q-learning by introducing Std -target (SQT), an uncertainty penalty derived from the disagreement among an ensemble of Q-networks. SQT is added to the standard -target used by TD3/TD7, yielding a new target: , where . Empirically, the method shows clear performance advantages over DDPG, TD3, and TD7 across seven MuJoCo/Bullet locomotion tasks, with improved stability and robustness due to the conservative updates. The paper situates SQT among related conservative and pessimistic Q-learning approaches and demonstrates that a minimal change to the -target can yield substantial gains in both sample efficiency and final performance, highlighting a practical, scalable path to mitigating overestimation bias in ensemble RL.

Abstract

Std -target is a conservative, actor-critic, ensemble, -learning-based algorithm, which is based on a single key -formula: -networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's -target formula superiority over TD3's -target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.
Paper Structure (9 sections, 14 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 9 sections, 14 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 1: SQT's architecture.
  • Figure 2: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on ant bullet.
  • Figure 3: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on cheetah bullet.
  • Figure 4: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on swimmer.
  • Figure 5: SQT when applied on top of TD7 vs. DDPG, TD3 and TD7 on hopper-bullet.