Table of Contents
Fetching ...

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

Brett Daley, Martha White, Christopher Amato, Marlos C. Machado

TL;DR

This work extends off-policy reinforcement learning theory from per-decision eligibility traces to trajectory-aware credit assignment, introducing a unifying operator $\mathcal{M}$ that can express both traditional and trajectory-aware methods.The authors prove convergence guarantees in the tabular setting for policy evaluation and control under a key condition on the tracing coefficients, and they analyze when existing trajectory-aware methods may diverge.A practical instantiation, Recency-Bounded Importance Sampling (RBIS), is proposed and shown to perform well across $\lambda$ values in off-policy control, with theoretical and empirical support.Overall, the paper provides a principled framework for designing trajectory-aware off-policy corrections and clarifies the trade-offs between trace preservation and variance, with implications for future work in function approximation and deep RL.

Abstract

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $λ$-values in an off-policy control task.

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

TL;DR

This work extends off-policy reinforcement learning theory from per-decision eligibility traces to trajectory-aware credit assignment, introducing a unifying operator $\mathcal{M}$ that can express both traditional and trajectory-aware methods.The authors prove convergence guarantees in the tabular setting for policy evaluation and control under a key condition on the tracing coefficients, and they analyze when existing trajectory-aware methods may diverge.A practical instantiation, Recency-Bounded Importance Sampling (RBIS), is proposed and shown to perform well across $\lambda$ values in off-policy control, with theoretical and empirical support.Overall, the paper provides a principled framework for designing trajectory-aware off-policy corrections and clarifies the trade-offs between trace preservation and variance, with implications for future work in function approximation and deep RL.

Abstract

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across -values in an off-policy control task.
Paper Structure (21 sections, 9 theorems, 50 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 21 sections, 9 theorems, 50 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 5.2

If cond:convergence holds, then $\mathcal{M}$ is a contraction mapping with $Q^\pi$ as its unique fixed point. Consequently, ${\lim_{i \to \infty} \mathcal{M}^i Q = Q^\pi}$, $\forall~Q \in \mathbb{R}^n$.

Figures (4)

  • Figure 1:
  • Figure 2: The Bifurcated Gridworld environment. The choice made at B greatly impacts the discounted return ultimately earned. We plot the AUC obtained by four off-policy methods across the $\lambda$-spectrum. The dashed horizontal lines mark the highest AUC achieved by each method.
  • Figure 3: Learning curves for the $\lambda$-values we tested in the Bifurcated Gridworld environment. The dashed black line indicates the optimal discounted return for this problem.
  • Figure 4: $\lambda$-sweeps conducted on three additional gridworld topologies. The experiment procedure was identical to that used in the creation of \ref{['fig:lambda_sweep']}.

Theorems & Definitions (18)

  • Theorem 5.2
  • proof
  • Theorem 5.3
  • proof : Proof (sketch; full proof in \ref{['app:theorem_control']}).
  • Proposition 5.3
  • proof
  • Proposition 5.3
  • proof
  • Proposition 5.3
  • proof
  • ...and 8 more