Table of Contents
Fetching ...

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

TL;DR

MINTO addresses the persistence of moving targets and maximization bias in off-policy reinforcement learning by combining online and target networks through a MINimum operator on their Q-value estimates. The core idea is to compute bootstrapped targets as $y = r + \gamma \max_{a'} \min\big(Q_{\bar{\theta}}(s',a'), Q_{\theta}(s',a')\big)$, enabling the online network to contribute when benefits outweigh its risks while defaulting to the stable target when necessary. The authors demonstrate that MINTO accelerates learning and improves final performance across online, offline, value-based, and actor–critic methods, with negligible computational overhead and no extra hyperparameters. The approach shows strong empirical gains on Atari, IQN, CQL, and Simba-based continuous-control tasks, and is supported by convergence guarantees in the tabular setting. These results suggest MINTO as a practical, broadly applicable alternative to conventional target-network designs, with potential for adaptive operator strategies and multi-task extensions in future work.

Abstract

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

TL;DR

MINTO addresses the persistence of moving targets and maximization bias in off-policy reinforcement learning by combining online and target networks through a MINimum operator on their Q-value estimates. The core idea is to compute bootstrapped targets as , enabling the online network to contribute when benefits outweigh its risks while defaulting to the stable target when necessary. The authors demonstrate that MINTO accelerates learning and improves final performance across online, offline, value-based, and actor–critic methods, with negligible computational overhead and no extra hyperparameters. The approach shows strong empirical gains on Atari, IQN, CQL, and Simba-based continuous-control tasks, and is supported by convergence guarantees in the tabular setting. These results suggest MINTO as a practical, broadly applicable alternative to conventional target-network designs, with potential for adaptive operator strategies and multi-task extensions in future work.

Abstract

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

Paper Structure

This paper contains 28 sections, 1 theorem, 7 equations, 20 figures, 5 tables, 4 algorithms.

Key Result

Corollary 1

Let the MINTO operator be defined as $G^{\text{MINTO}}(Q_{s}) = \max_{a \in \mathcal{A}} \min_{j \in \mathcal{T}} Q_{sa}(j)$, where $\mathcal{T}$ is a set of historical time indices. Under the standard stochastic approximation assumptions for the learning rate (Assumption 2 in lan2020maxmin), the Q-

Figures (20)

  • Figure 1: Results of benchmarking the Minimum operator utilized by MINTO against other potential operators on $15$ Atari games with the CNN architecture. We report the AUC metric using IQM and the confidence interval computed across $5$ seeds. Methods are trained for $50$ million frames.
  • Figure 2: Results of benchmarking MINTO and DQN on $15$ Atari games with CNN, and IMPALA with LayerNorm (LN) architectures. All results are reported using IQM and confidence interval on $5$ seeds and over all games. Left: We report the AUC metric for MINTO and DQN while utilizing both architectures. Center and Right: We illustrate the performance learning curves of MINTO and DQN after employing both architecture options.
  • Figure 3: Results of benchmarking MINTO against related baselines on $15$ Atari games with the CNN architecture. We report the AUC metric using IQM and the confidence interval computed across $5$ seeds.
  • Figure 4: Results of benchmarking MINTO+IQN and IQN on $15$ Atari games in the online RL setting using the CNN architecture. All metrics are computed over $5$ seeds. Left: We report the AUC metric computed by the IQM and its confidence interval. Center: We demonstrate the performance learning curves of both methods in terms of IQM return and confidence interval. Right: We report the frequency of selecting the online estimate during the course of training on the Breakout game, considering the mean and standard deviation.
  • Figure 5: Results of benchmarking CQL and CQL+MINTO on $14$ Atari games with CNN, and IMPALA with LayerNorm (LN) architectures. All results are reported using IQM and confidence interval on $5$ seeds and over all games. Left: We report the AUC metric for both methods while utilizing the two architectures. Center and Right: We illustrate the performance learning curves of CQL and CQL+MINTO after employing both architecture options.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Corollary 1: Convergence of MINTO
  • proof