Table of Contents
Fetching ...

Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

Lipeng Zu, Hansong Zhou, Xiaonan Zhang

TL;DR

The paper tackles instability in DQN updates caused by relying on off-policy next states from replay buffers. It introduces SADQ, a framework that learns a stochastic successor-state predictor to augment Q-value updates, action selection, and (for images) distributional targets, while proving unbiasedness and reduced variance. The approach yields consistent improvements in stability and sample efficiency across vector-based control tasks, Atari games, and real-world scenarios like CityFlow and O-Cloud. By explicitly modeling one-step successor dynamics, SADQ provides richer future-state guidance that better aligns learning with the current policy, enabling more robust value propagation and faster convergence. This work offers a practical, theoretically grounded direction for integrating lightweight model-based signals into value-based deep RL.

Abstract

Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent's current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.

Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction

TL;DR

The paper tackles instability in DQN updates caused by relying on off-policy next states from replay buffers. It introduces SADQ, a framework that learns a stochastic successor-state predictor to augment Q-value updates, action selection, and (for images) distributional targets, while proving unbiasedness and reduced variance. The approach yields consistent improvements in stability and sample efficiency across vector-based control tasks, Atari games, and real-world scenarios like CityFlow and O-Cloud. By explicitly modeling one-step successor dynamics, SADQ provides richer future-state guidance that better aligns learning with the current policy, enabling more robust value propagation and faster convergence. This work offers a practical, theoretically grounded direction for integrating lightweight model-based signals into value-based deep RL.

Abstract

Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent's current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.

Paper Structure

This paper contains 19 sections, 4 theorems, 66 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Incorporating $V'(\hat{s}'_{\mathcal{M}})$ into the update process will not add additional bias between the estimated Q-value and the optimal Q-value.

Figures (12)

  • Figure 1: Upper: Performance comparison of basic DQN and Rainbow across Acrobot and Cartpole environments. Lower: The evaluation of Q discrepancy ($\max Q(s,a) - \min Q(s,a)$) of DQN.
  • Figure 2: Performance comparison of SADQ with other baselines across conventional RL tasks. The legend above describes the corresponding methods in (a) Acrobot, (b) BitFlip and (c) LunarLander.
  • Figure 3: Effects of stochastic model convergence to performance. (a) Effects of updating frequency to SADQ and compare with C51 and DQN; (b) The Loss of stochastic model among different updating configurations.
  • Figure 4: Performance comparison of SADQ with other baselines across CityFlow and O-Cloud scenarios. (a) Performance in CityFlow scenario; (b) Training loss in CityFlow; (c) Performance in O-Cloud scenario; (d) Training loss in O-Cloud.
  • Figure 5: Abaltion study of SADQ in Acrobot and LunarLander.
  • ...and 7 more figures

Theorems & Definitions (7)

  • Theorem 1
  • proof
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • proof
  • proof