Table of Contents
Fetching ...

Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

Théo Vincent, Yogesh Tripathi, Tim Faust, Abdullah Akgül, Yaniv Oren, Melih Kandemir, Jan Peters, Carlo D'Eramo

TL;DR

This work introduces a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network, to bridge the performance gap between target-free and target-based approaches across various problems while using a single $Q-network.

Abstract

The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated $Q$-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared $Q$-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems while using a single $Q$-network, thus stepping towards resource-efficient reinforcement learning algorithms.

Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

TL;DR

This work introduces a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network, to bridge the performance gap between target-free and target-based approaches across various problems while using a single $Q-network.

Abstract

The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated -learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared -Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems while using a single -network, thus stepping towards resource-efficient reinforcement learning algorithms.

Paper Structure

This paper contains 28 sections, 8 equations, 26 figures, 3 tables, 1 algorithm.

Figures (26)

  • Figure 1: We propose a simple alternative to target-based/target-free approaches, where a linear layer represents the target network, sharing the rest of the parameters with the online network (Shared Features). We apply the concept of iterated $Q$-learning vincent2025iterated, which consists of learning multiple Bellman updates in parallel, to reduce the performance gap between target-free and target-based approaches (iterated Shared Features).
  • Figure 2: Comparison of the training path defined by the target networks obtained after each target update during training between the target-based approach (bottom) and the iterated Shared Features approach (top). While both approaches wait for $T$ training steps before shifting their respective window by one $Q$-function, our approach already considers the following Bellman iterations using multiple heads, where each head represents the Bellman iteration of the previous head.
  • Figure 3: Reducing the performance gap in online RL on $15$Atari games with the CNN architecture and LayerNorm (LN). While removing the target network leads to a $10\%$ drop in AUC (left), iS-DQN $K=9$ (using $10$ linear heads), not only closes the gap but improves over the target-based approach by $6\%$. Importantly, iS-DQN uses a comparable number of parameters to TF-DQN (right).
  • Figure 4: Left: Reducing the performance gap in online RL on $10$Atari games with the IMPALA architecture and LayerNorm (LN). Similar to the results with the CNN architecture, iS-DQN bridges the gap between the target-free and target-based approaches. Middle and Right: Reducing the performance gap in online RL on $15$Atari games with the CNN architecture. Removing the target network of the vanilla DQN algorithm results in a $60\%$ performance drop ($100\% - 40\%$). By using iS-DQN with $K=3$, the performance drop is divided by $4$ ($100\% - 85\% = 15\% = 60\% / 4$), thereby confirming the benefit of this approach.
  • Figure 5: Reducing the performance gap in offline RL on $10$Atari games with the IMPALA architecture and LayerNorm (LN). iS-CQL shrinks the performance gap from $26\%$ to $6\%$. Interestingly, applying the idea of sharing parameters to Ensemble DQN (Ensemble Shared Features, ES-CQL) also reduces the performance gap, demonstrating that this idea is not limited to iterated $Q$-learning and can be applied to other target-based approaches.
  • ...and 21 more figures