Enhancing Q-Value Updates in Deep Q-Learning via Successor-State Prediction
Lipeng Zu, Hansong Zhou, Xiaonan Zhang
TL;DR
The paper tackles instability in DQN updates caused by relying on off-policy next states from replay buffers. It introduces SADQ, a framework that learns a stochastic successor-state predictor to augment Q-value updates, action selection, and (for images) distributional targets, while proving unbiasedness and reduced variance. The approach yields consistent improvements in stability and sample efficiency across vector-based control tasks, Atari games, and real-world scenarios like CityFlow and O-Cloud. By explicitly modeling one-step successor dynamics, SADQ provides richer future-state guidance that better aligns learning with the current policy, enabling more robust value propagation and faster convergence. This work offers a practical, theoretically grounded direction for integrating lightweight model-based signals into value-based deep RL.
Abstract
Deep Q-Networks (DQNs) estimate future returns by learning from transitions sampled from a replay buffer. However, the target updates in DQN often rely on next states generated by actions from past, potentially suboptimal, policy. As a result, these states may not provide informative learning signals, causing high variance into the update process. This issue is exacerbated when the sampled transitions are poorly aligned with the agent's current policy. To address this limitation, we propose the Successor-state Aggregation Deep Q-Network (SADQ), which explicitly models environment dynamics using a stochastic transition model. SADQ integrates successor-state distributions into the Q-value estimation process, enabling more stable and policy-aligned value updates. Additionally, it explores a more efficient action selection strategy with the modeled transition structure. We provide theoretical guarantees that SADQ maintains unbiased value estimates while reducing training variance. Our extensive empirical results across standard RL benchmarks and real-world vector-based control tasks demonstrate that SADQ consistently outperforms DQN variants in both stability and learning efficiency.
