Table of Contents
Fetching ...

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu, Amy Greenwald, Ronald Parr

TL;DR

This work provides a unifying matrix-splitting framework that treats TD, FQI, and PFQI as the same preconditioned iterative method solving a common target linear system in linear function approximation for off-policy policy evaluation. By interpreting target networks as transitions between preconditioners, the authors connect convergence properties across methods, showing rank invariance and consistency are the core determinants of convergence rather than the number of updates alone. They introduce an encoder-decoder perspective on the target system, explain stability under various feature assumptions, and establish sharp convergence conditions, including when over-parameterization helps or hurts. The results clarify longstanding questions about when TD and FQI converge, explain PFQI’s transitional behavior, and provide new insights into feature requirements and stability in OPE with linear function approximation. This framework lays groundwork for sharper theoretical guarantees and could inform the design of new, more robust off-policy algorithms.

Abstract

In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving the same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand the convergence conditions of TD, and prove, for the first time, that when a large learning rate doesn't work, trying a smaller one may help. Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

TL;DR

This work provides a unifying matrix-splitting framework that treats TD, FQI, and PFQI as the same preconditioned iterative method solving a common target linear system in linear function approximation for off-policy policy evaluation. By interpreting target networks as transitions between preconditioners, the authors connect convergence properties across methods, showing rank invariance and consistency are the core determinants of convergence rather than the number of updates alone. They introduce an encoder-decoder perspective on the target system, explain stability under various feature assumptions, and establish sharp convergence conditions, including when over-parameterization helps or hurts. The results clarify longstanding questions about when TD and FQI converge, explain PFQI’s transitional behavior, and provide new insights into feature requirements and stability in OPE with linear function approximation. This framework lays groundwork for sharper theoretical guarantees and could inform the design of new, more robust off-policy algorithms.

Abstract

In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving the same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand the convergence conditions of TD, and prove, for the first time, that when a large learning rate doesn't work, trying a smaller one may help. Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.
Paper Structure (171 sections, 102 theorems, 417 equations)

This paper contains 171 sections, 102 theorems, 417 equations.

Key Result

Proposition 3.1

(1) $\Theta_{\text{LSTD}}\supseteq\Theta_{\text{FQI}}$. (2) $\Theta_{\text{LSTD}}=\Theta_{\text{FQI}}$ if and only if $\operatorname{Rank}\left(\Sigma_{cov} - \gamma\Sigma_{c r}\right)=\operatorname{Rank}\left(I-\gamma\Sigma_{cov}^{\dagger}\Sigma_{c r}\right)$. (3) If $\Phi$ is full column rank, $\T

Theorems & Definitions (184)

  • Proposition 3.1
  • Proposition 4.2: Universal Consistency
  • Proposition 4.5
  • Proposition 4.6
  • Proposition 4.7
  • Proposition 4.9
  • Theorem 5.1
  • Lemma 5.2
  • Corollary 5.3
  • Theorem 6.1
  • ...and 174 more