Table of Contents
Fetching ...

Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A Ramirez, Christopher K Harris, A. Rupam Mahmood, Dale Schuurmans

TL;DR

This work tackles off-policy evaluation with offline data, where the deadly triad of off-policy data, bootstrapping, and function approximation can cause TD divergence. It introduces Over-parameterized Target TD (OTTD), which couples a target network with an over-parameterized linear representation to yield convergence to the TD fixed point under broad data-coverage conditions, and it shows this remains robust when learning from trajectories via Normalized Importance Sampling (NIS). The authors provide high-probability error bounds and demonstrate empirical stability on Baird's counterexample and a Four Room task, while also extending the results to offline Q-learning. The findings imply that fixed-point convergence becomes distribution-agnostic under over-parameterization and target networks, suggesting practical avenues for reliable offline reinforcement learning with linear function approximation and potential extensions to nonlinear architectures.

Abstract

We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

TL;DR

This work tackles off-policy evaluation with offline data, where the deadly triad of off-policy data, bootstrapping, and function approximation can cause TD divergence. It introduces Over-parameterized Target TD (OTTD), which couples a target network with an over-parameterized linear representation to yield convergence to the TD fixed point under broad data-coverage conditions, and it shows this remains robust when learning from trajectories via Normalized Importance Sampling (NIS). The authors provide high-probability error bounds and demonstrate empirical stability on Baird's counterexample and a Four Room task, while also extending the results to offline Q-learning. The findings imply that fixed-point convergence becomes distribution-agnostic under over-parameterization and target networks, suggesting practical avenues for reliable offline reinforcement learning with linear function approximation and potential extensions to nonlinear architectures.

Abstract

We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
Paper Structure (28 sections, 22 theorems, 68 equations, 5 figures, 6 tables)

This paper contains 28 sections, 22 theorems, 68 equations, 5 figures, 6 tables.

Key Result

Proposition 3.1

For the over-parameterized regime $d>k$, if the following two conditions hold: then there exists a learning rate $\eta$ such that the parameter of OTD updates converges to when the initial parameter of OTD equals zero.

Figures (5)

  • Figure 1: On Baird counterexample, states are sampled from a uniform distribution, and there exists only one action at each state. The discount factor is set to be $\gamma=0.95$. We plot the maximal value prediction error among all states for OTD, OTTD, RM, and GTD algorithms. Other than OTD, the value errors converge to zero for the rest algorithms. OTTD avoids the divergence of TD and slow convergence rate of others.
  • Figure 2: In this example, each state has exactly one action, and rewards are labeled next to the transitions. The value functions are parameterized by a scalar parameter $\theta$, and the features are shown in the graph. This counterexample demonstrates a task with fixed transition probabilities and rewards where our convergence condition is satisfied. However, the conditions for under-parameterized target TD fail for certain data distributions.
  • Figure 3: On Four Room, data are sampled as trajectories under the random policy, while the target policy is given by a human player. The left sub-figure shows the training error, EMSBE. The middle and the right sub-figures show the infinity norm of value errors, that is, $\lVert \Phi \theta_{\mathrm{TD}}^*-q_{\pi} \rVert_{\infty}$. Here, the right sub-figure uses a larger dataset, and all results are averaged over $10$ random seeds. With off-policy data, per-step normalized IS can correct the action distribution and behave similarly to sampling actions from the target policy. Due to less variance, normalized IS avoids divergence of IS correction.
  • Figure 4: The features are shown in the figure. The transition is labelled with arrows. This Baird Counterexample is a Markov Reward process and only one action is available at each state.
  • Figure 5: Black blocks are walls which cannot be trespassed, green ones are hallways and the purple block is the terminal state with $+1$ reward. Each state has $(x,y)$ coordinate and actions include up, down, left and right.

Theorems & Definitions (34)

  • Proposition 3.1
  • Theorem 3.2
  • Remark 3.3
  • Proposition 3.4
  • Theorem 3.5
  • Corollary 3.6
  • Proposition 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Theorem 5.1
  • ...and 24 more