Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

Théo Vincent; Daniel Palenicek; Boris Belousov; Jan Peters; Carlo D'Eramo

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

Théo Vincent, Daniel Palenicek, Boris Belousov, Jan Peters, Carlo D'Eramo

TL;DR

The paper tackles sample efficiency and computation bottlenecks in value-based and actor-critic RL caused by one-step Bellman updates. It introduces Iterated $Q$-Network (i-QN), a telescoped framework that learns $K$ consecutive Bellman updates in parallel using a chain of $Q$-functions connected by targets, improving information reuse and sample efficiency. The authors provide theoretical justification and a sufficient condition for the descent of the cumulative approximation error, and demonstrate practical instantiations (i-DQN, i-SAC) with strong empirical gains on Atari $2600$ and MuJoCo tasks, while highlighting tradeoffs in training time and memory. The work shows that parallelizing Bellman updates can significantly boost performance, motivating extensions to other sequential RL methods and offering a new tool for scalable, high-performance reinforcement learning.

Abstract

The vast majority of Reinforcement Learning methods is largely impacted by the computation effort and data requirements needed to obtain effective estimates of action-value functions, which in turn determine the quality of the overall performance and the sample-efficiency of the learning procedure. Typically, action-value functions are estimated through an iterative scheme that alternates the application of an empirical approximation of the Bellman operator and a subsequent projection step onto a considered function space. It has been observed that this scheme can be potentially generalized to carry out multiple iterations of the Bellman operator at once, benefiting the underlying learning algorithm. However, till now, it has been challenging to effectively implement this idea, especially in high-dimensional problems. In this paper, we introduce iterated $Q$-Network (i-QN), a novel principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions where each serves as the target for the next. We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate the advantages of i-QN in Atari $2600$ games and MuJoCo continuous control problems.

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

TL;DR

The paper tackles sample efficiency and computation bottlenecks in value-based and actor-critic RL caused by one-step Bellman updates. It introduces Iterated

-Network (i-QN), a telescoped framework that learns

consecutive Bellman updates in parallel using a chain of

-functions connected by targets, improving information reuse and sample efficiency. The authors provide theoretical justification and a sufficient condition for the descent of the cumulative approximation error, and demonstrate practical instantiations (i-DQN, i-SAC) with strong empirical gains on Atari

and MuJoCo tasks, while highlighting tradeoffs in training time and memory. The work shows that parallelizing Bellman updates can significantly boost performance, motivating extensions to other sequential RL methods and offering a new tool for scalable, high-performance reinforcement learning.

Abstract

-Network (i-QN), a novel principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions where each serves as the target for the next. We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate the advantages of i-QN in Atari

games and MuJoCo continuous control problems.

Paper Structure (28 sections, 2 theorems, 11 equations, 22 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 2 theorems, 11 equations, 22 figures, 5 tables, 2 algorithms.

Introduction
Preliminaries
Related work
Learning multiple Bellman updates
Practical implementation
Motivating example
Experiments
Atari 2600
Atari results.
Ablation studies
MuJoCo continuous control
Discussion and conclusion
Proofs
Pseudocodes
Experiments details
...and 13 more sections

Key Result

Proposition 4.1

Let $t \in \mathbb{N}$, $(\theta_k^t)_{k = 0}^K$ be a sequence of parameters of $\Theta$, and $\nu$ be a probability distribution over state-action pairs. If, for every $k \in \{1, .., K\}$, then, we have

Figures (22)

Figure 1: Iterated $Q$-Network (ours) uses the online network of regular $Q$-Network approaches to build a target for a second online network, and so on, through the application of the Bellman operator $\Gamma$. The resulting loss $\mathcal{L}_{\text{i-QN}}$ comprises $K$ temporal difference errors instead of just one as in $\mathcal{L}_{\text{QN}}$.
Figure 2: Graphical representation of the regular $Q$-Network approach (left) compared to our proposed iterated $Q$-Network approach (right) in the space of $Q$-functions $\mathcal{Q}$. The regular $Q$-Network approach proceeds sequentially, i.e., $Q_2$ is learned only when the learning process of $Q_1$ is finished. With iterated $Q$-Network, all parameters are learned simultaneously. The projection of $Q^{\star}$ and projections of the Bellman update are depicted with a dashed line. The losses are shown in red.
Figure 3: Other empirical Bellman operators can be represented using another notation $\tilde{\Gamma}$ than the classical empirical Bellman operator $\widehat{\Gamma}$. Changing the class of function approximators $\mathcal{Q}_{\Theta}$ results in a new space $\tilde{\mathcal{Q}}_{\Theta}$.
Figure 4: Left: Graphical representation of i-QN where each online networks $Q_k$ learns from its respective target network $\bar{Q}_{k-1}$. Every $D$ steps, each target network $\bar{Q}_k$ ($k > 0$) is updated to its respective $Q_k$. Right: i-QN considers a window of $K$ Bellman updates as opposed to QN methods that consider only $1$ Bellman update. Every $T$ steps, the windows are shifted forward to consider the following Bellman updates.
Figure 5: Distance between the optimal value function $V^*$ and $V^{\pi_k}$ at each Bellman iteration $k$ for different $K$.
...and 17 more figures

Theorems & Definitions (4)

Proposition 4.1
proof
Proposition A.1
proof

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

TL;DR

Abstract

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (4)