Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning

Théo Vincent; Fabian Wahren; Jan Peters; Boris Belousov; Carlo D'Eramo

Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning

Théo Vincent, Fabian Wahren, Jan Peters, Boris Belousov, Carlo D'Eramo

TL;DR

This work tackles the sensitivity of deep RL to hyperparameters by introducing Adaptive $Q$-Network (AdaQN), an online ensemble approach that trains multiple $Q$-functions with different hyperparameters and selects the one with the smallest approximation error to form the shared Bellman target. The method provides a principled way to cope with non-stationarity in RL without extra environment samples, and it can be instantiated as AdaDQN or AdaSAC. The authors prove convergence in the tabular setting and demonstrate substantial gains in sample efficiency, final performance, and robustness across MuJoCo and Atari benchmarks. The results suggest AdaQN can dynamically tailor hyperparameter schedules to the problem, offering practical impact for real-world RL deployments where manual tuning is infeasible.

Abstract

Deep Reinforcement Learning (RL) is well known for being highly sensitive to hyperparameters, requiring practitioners substantial efforts to optimize them for the problem at hand. This also limits the applicability of RL in real-world scenarios. In recent years, the field of automated Reinforcement Learning (AutoRL) has grown in popularity by trying to address this issue. However, these approaches typically hinge on additional samples to select well-performing hyperparameters, hindering sample-efficiency and practicality. Furthermore, most AutoRL methods are heavily based on already existing AutoML methods, which were originally developed neglecting the additional challenges inherent to RL due to its non-stationarities. In this work, we propose a new approach for AutoRL, called Adaptive $Q$-Network (AdaQN), that is tailored to RL to take into account the non-stationarity of the optimization procedure without requiring additional samples. AdaQN learns several $Q$-functions, each one trained with different hyperparameters, which are updated online using the $Q$-function with the smallest approximation error as a shared target. Our selection scheme simultaneously handles different hyperparameters while coping with the non-stationarity induced by the RL optimization procedure and being orthogonal to any critic-based RL algorithm. We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari $2600$ games, showing benefits in sample-efficiency, overall performance, robustness to stochasticity and training stability.

Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning

TL;DR

This work tackles the sensitivity of deep RL to hyperparameters by introducing Adaptive

-Network (AdaQN), an online ensemble approach that trains multiple

-functions with different hyperparameters and selects the one with the smallest approximation error to form the shared Bellman target. The method provides a principled way to cope with non-stationarity in RL without extra environment samples, and it can be instantiated as AdaDQN or AdaSAC. The authors prove convergence in the tabular setting and demonstrate substantial gains in sample efficiency, final performance, and robustness across MuJoCo and Atari benchmarks. The results suggest AdaQN can dynamically tailor hyperparameter schedules to the problem, offering practical impact for real-world RL deployments where manual tuning is infeasible.

Abstract

-Network (AdaQN), that is tailored to RL to take into account the non-stationarity of the optimization procedure without requiring additional samples. AdaQN learns several

-functions, each one trained with different hyperparameters, which are updated online using the

-function with the smallest approximation error as a shared target. Our selection scheme simultaneously handles different hyperparameters while coping with the non-stationarity induced by the RL optimization procedure and being orthogonal to any critic-based RL algorithm. We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari

games, showing benefits in sample-efficiency, overall performance, robustness to stochasticity and training stability.

Paper Structure (24 sections, 1 theorem, 13 equations, 23 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 1 theorem, 13 equations, 23 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries
Related work
Adaptive temporal-difference target selection
Algorithmic implementation
Experiments
A proof of concept
Continuous control: MuJoCo environments
Vision-based control: Atari $2600$
Infinite hyperparameter spaces
Discussion and conclusion
Theorem statements and proofs
Convergence of AdaQN
Algorithms and hyperparameters
Experimental setup
...and 9 more sections

Key Result

Theorem 4.1

Let $( \theta^k )_{k = 1}^K \in \Theta^{K}$ and $\bar{\theta} \in \Theta$ be vectors of parameters representing $K + 1$$Q$-functions. Let $\mathcal{D} = \{ (s, a, r, s') \}$ be a set of samples. Let $\nu$ be the distribution represented by the state-action pairs present in $\mathcal{D}$. We note $\m

Figures (23)

Figure 1: Left: Each line represents a training of $Q$-Network (QN) with different hyperparameters. Right: At the $i^{\text{th}}$ target update, Adaptive $Q$-Network (AdaQN) selects the network $Q_i$ (highlighted with a crown) that is the closest to the previous target $\Gamma \bar{Q}_{i-1}$, where $\Gamma$ is the Bellman operator.
Figure 2: On-the-fly architecture selection on Lunar Lander. All architectures contain two hidden layers. The number of neurons in each layer is indicated in the legend. Left: AdaDQN yields a better AUC than every DQN run. Right: Ablation on the behavioral policy and on the strategy to select the target network used to compute the target. Each version of AdaDQN uses the $4$ presented architectures. The strategy presented in Equation (\ref{['E:theta_i']}) outperforms the other considered strategies.
Figure 3: Distribution of the selected hyperparameters for the target network (top) and for the behavioral policy (bottom) across all seeds. Left: AdaDQN mainly selects the hyperparameter that performs best when evaluated individually. Middle: AdaDQN with $\epsilon_b = 0$ also focuses on the best individual architecture but does not use all available networks for sampling actions, which lowers its performance. Right: AdaDQN-max is a version of AdaDQN where the minimum operator is replaced by the maximum operator for selecting the following target network.
Figure 4: On-the-fly hyperparameter selection on MuJoCo. The $16$ sets of hyperparameters are the elements of the Cartesian product between the learning rates $\{0.0005, 0.001\}$, the optimizers $\{$Adam, RMSProp$\}$, the critic's architectures $\{[256, 256], [512, 512]\}$ and the activation functions $\{$ReLU, Sigmoid$\}$. Left: AdaSAC is more sample-efficient than random search and grid search. Right: AdaSAC yields a better AUC than every SAC run while having a greater final score than $13$ out of $16$ SAC runs. The shading of the dashed lines indicates their ranking for the AUC metric.
Figure 5: Per environment IQM return when AdaSAC and RandSAC select from the $16$ sets of hyperparameters described in Figure \ref{['F:mujoco']}. Below each performance plot, a bar plot presents the distribution of hyperparameters selected for the target network across all seeds. AdaSAC outperforms RandSAC and most individual run by designing non-trivial hyperparameter schedules.
...and 18 more figures

Theorems & Definitions (2)

Theorem 4.1
proof

Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning

TL;DR

Abstract

Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (2)