Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss
Ukjo Hwang, Songnam Hong
TL;DR
This work addresses overestimation bias in model-free reinforcement learning by introducing a moderate target that convexly combines an overestimated Q-value with a state-conditioned lower bound learned via an expectile value network. The method hinges on a protester $V_\psi(s)$ trained with expectile loss and a cautious weight $\omega$ to form $y_{mt}$, which is integrated into DDPG, SAC, and distributional RL frameworks to yield MPG, MPG-SD, MAC, and MQC. Empirical results on MuJoCo continuous control tasks show improved performance and stability (lower variance) relative to strong baselines, without added computational burden. The approach is modular, can extend to discrete tasks (Q-learning, DQN) and other policy-based methods (A2C, PPO), and offers a tunable mechanism to trade off optimism and conservatism in value estimates, with potential broad impact on robust MF-RL deployments.
Abstract
Overestimation is a fundamental characteristic of model-free reinforcement learning (MF-RL), arising from the principles of temporal difference learning and the approximation of the Q-function. To address this challenge, we propose a novel moderate target in the Q-function update, formulated as a convex optimization of an overestimated Q-function and its lower bound. Our primary contribution lies in the efficient estimation of this lower bound through the lower expectile of the Q-value distribution conditioned on a state. Notably, our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC). Experimental results validate the effectiveness of our moderate target in mitigating overestimation bias in DDPG, SAC, and distributional RL algorithms.
