Table of Contents
Fetching ...

Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss

Ukjo Hwang, Songnam Hong

TL;DR

This work addresses overestimation bias in model-free reinforcement learning by introducing a moderate target that convexly combines an overestimated Q-value with a state-conditioned lower bound learned via an expectile value network. The method hinges on a protester $V_\psi(s)$ trained with expectile loss and a cautious weight $\omega$ to form $y_{mt}$, which is integrated into DDPG, SAC, and distributional RL frameworks to yield MPG, MPG-SD, MAC, and MQC. Empirical results on MuJoCo continuous control tasks show improved performance and stability (lower variance) relative to strong baselines, without added computational burden. The approach is modular, can extend to discrete tasks (Q-learning, DQN) and other policy-based methods (A2C, PPO), and offers a tunable mechanism to trade off optimism and conservatism in value estimates, with potential broad impact on robust MF-RL deployments.

Abstract

Overestimation is a fundamental characteristic of model-free reinforcement learning (MF-RL), arising from the principles of temporal difference learning and the approximation of the Q-function. To address this challenge, we propose a novel moderate target in the Q-function update, formulated as a convex optimization of an overestimated Q-function and its lower bound. Our primary contribution lies in the efficient estimation of this lower bound through the lower expectile of the Q-value distribution conditioned on a state. Notably, our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC). Experimental results validate the effectiveness of our moderate target in mitigating overestimation bias in DDPG, SAC, and distributional RL algorithms.

Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss

TL;DR

This work addresses overestimation bias in model-free reinforcement learning by introducing a moderate target that convexly combines an overestimated Q-value with a state-conditioned lower bound learned via an expectile value network. The method hinges on a protester trained with expectile loss and a cautious weight to form , which is integrated into DDPG, SAC, and distributional RL frameworks to yield MPG, MPG-SD, MAC, and MQC. Empirical results on MuJoCo continuous control tasks show improved performance and stability (lower variance) relative to strong baselines, without added computational burden. The approach is modular, can extend to discrete tasks (Q-learning, DQN) and other policy-based methods (A2C, PPO), and offers a tunable mechanism to trade off optimism and conservatism in value estimates, with potential broad impact on robust MF-RL deployments.

Abstract

Overestimation is a fundamental characteristic of model-free reinforcement learning (MF-RL), arising from the principles of temporal difference learning and the approximation of the Q-function. To address this challenge, we propose a novel moderate target in the Q-function update, formulated as a convex optimization of an overestimated Q-function and its lower bound. Our primary contribution lies in the efficient estimation of this lower bound through the lower expectile of the Q-value distribution conditioned on a state. Notably, our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC). Experimental results validate the effectiveness of our moderate target in mitigating overestimation bias in DDPG, SAC, and distributional RL algorithms.

Paper Structure

This paper contains 24 sections, 1 theorem, 39 equations, 4 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1.1

For any $\gamma \in (0, 1)$ and $\omega \in [0, 1]$, the moderate Bellman operator $\mathcal{T}_m$ in Equation eq:moderate_bellman_equation is a contraction with respect to the $l_\infty$-norm. Consequently, the action-value function $Q$ has a unique fixed point.

Figures (4)

  • Figure 1: Structure diagrams of algorithms utilizing the proposed protester.
  • Figure 2: Relationships between the proposed and benchmark algorithms, where the shaded boxes represent our algorithms.
  • Figure 3: Learning curves for MuJoCo continuous control tasks. The solid lines denote the average rewards and the shaded areas indicate half the standard deviation of the average evaluations over five episodes. Curves are smoothed with a moving average window for clarity.
  • Figure 4: Log-transformed target Q-value during training.

Theorems & Definitions (2)

  • Theorem 1.1
  • proof