Table of Contents
Fetching ...

Learning the Target Network in Function Space

Kavosh Asadi, Yao Liu, Shoham Sabach, Ming Yin, Rasool Fakoor

TL;DR

This work proposes Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence between the two networks in the function space, and shows that LR leads to a convergent behavior in learning the value function.

Abstract

We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algorithm is designed to maintain an equivalence between the two networks in the function space. This value-based equivalence is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based target-network updates significantly improve deep RL on the Atari benchmark.

Learning the Target Network in Function Space

TL;DR

This work proposes Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence between the two networks in the function space, and shows that LR leads to a convergent behavior in learning the value function.

Abstract

We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algorithm is designed to maintain an equivalence between the two networks in the function space. This value-based equivalence is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based target-network updates significantly improve deep RL on the Atari benchmark.
Paper Structure (15 sections, 8 theorems, 66 equations, 10 figures, 3 algorithms)

This paper contains 15 sections, 8 theorems, 66 equations, 10 figures, 3 algorithms.

Key Result

Theorem 4.3

Let $\left\{ (\theta^{t} , w^{t}) \right\}_{t \in \mathbb{N}}$ be a sequence of parameters generated by the Lookahead-Replicate algorithm. Assume $F_w>\max\{F_\theta, 7\kappa_1^2,\frac{4\kappa_1^2}{1-\zeta}\}$. Given appropriate settings of step-sizes $(\alpha,\beta)$, where $\alpha,\beta,\zeta$ exp for some $\sigma < 1$. In particular, the pair $(\theta^{\star} , w^{\star})\in \mathcal{F}_{value}

Figures (10)

  • Figure 1: A sample trial of the LR algorithm on the Markov chain. The iterations of $v_\theta$ and $v_w$ in the value space (left), and the iterations of the two parameters $\theta$ and $w$ in the parameter space (right). Notice that LR converges to a pair of points where the value functions are equivalent $v_{\theta} = v_{w}$ despite the fact that $\theta \neq w$.
  • Figure 2: A comparison between two variations of LR (blue and red) with Rainbow under the default frequency-based (black) and Polyak-based (green) updates. The first variation of LR (blue) performs the Replicate step by sampling states and actions from the replay buffer and minimizing the value difference between the target and online network. The second variation of LR (red) only samples states from the replay buffer and minimizes the value difference for all actions in each sampled state. Results are averaged over 5 random seeds.
  • Figure 3: A Comparison between Rainbow and LR with different values of $K_R$ which is the number of gradient updates to the target network before updating the online network. The y-axis is the median of human-normalized performance across the 6 games. We are using 5 random seeds to aggregate the results. Higher is better.
  • Figure 4: Performance of the LR agent as a function of $K_R$ the number of updates to the target network in the Replicate step. Higher is better. Notice that an intermediate value of $K_R$ performs best.
  • Figure 5: Parameter norm of target and online Q network in different algorithms, averaged over 6 games in Figure \ref{['fig:K800']}.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Theorem 4.3
  • Corollary 4.4
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • proof
  • Lemma 1.4
  • proof
  • ...and 5 more