Demystifying the Recency Heuristic in Temporal-Difference Learning

Brett Daley; Marlos C. Machado; Martha White

Demystifying the Recency Heuristic in Temporal-Difference Learning

Brett Daley, Marlos C. Machado, Martha White

TL;DR

This paper analyzes the recency heuristic used in temporal-difference learning, formalizing it via forward and backward TD views and proving that convex returns guarantee contraction and convergence, while violating the heuristic can lead to divergence in on-policy, tabular settings. It shows that the weak recency heuristic is equivalent to expressing return targets as convex combinations of $n$-step returns, unifying forward and backward perspectives and explaining the robustness of TD($\lambda$). The authors provide a counterexample demonstrating divergence under non-recent credit and extend the theory to off-policy trajectory-aware traces and function approximation, highlighting risks and requirements for new credit-assignment schemes. Overall, the results justify the empirical success of recency-based TD methods and offer a theoretical framework for designing credit- assignment strategies beyond the standard recency assumption.

Abstract

The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD($λ$), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as $n$-step returns, satisfy a weaker (i.e., non-monotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a long window of effective credit assignment, yet bounded worst-case variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.

Demystifying the Recency Heuristic in Temporal-Difference Learning

TL;DR

-step returns, unifying forward and backward perspectives and explaining the robustness of TD(

). The authors provide a counterexample demonstrating divergence under non-recent credit and extend the theory to off-policy trajectory-aware traces and function approximation, highlighting risks and requirements for new credit-assignment schemes. Overall, the results justify the empirical success of recency-based TD methods and offer a theoretical framework for designing credit- assignment strategies beyond the standard recency assumption.

Abstract

), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as

-step returns, satisfy a weaker (i.e., non-monotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a long window of effective credit assignment, yet bounded worst-case variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.

Paper Structure (19 sections, 6 theorems, 29 equations, 8 figures, 1 table)

This paper contains 19 sections, 6 theorems, 29 equations, 8 figures, 1 table.

Introduction
Background
TD($\lambda$) and the Recency Heuristic
$n$-step Returns and Compound Returns
Value-Function Operators and Convergence Conditions
Formalizing the Recency Heuristic
What Happens When the Recency Heuristic Is Violated?
Only Convex Returns Satisfy the Weak Recency Heuristic
Are Monotonically Decreasing Weights Necessary?
Off-Policy Learning and Other Extensions
Conclusion
Proofs
Proof of \ref{['prop:sample-real_op']}
Proof of \ref{['prop:wrh']}
Proof of \ref{['prop:variance']}
...and 4 more sections

Key Result

Proposition 5.1

For every sample-realizable operator ${\bm{H}}$ whose fixed point is ${\bm{v}}_\pi$, there exists a sequence of real numbers $(h_i)_{i=0}^\infty$ such that If we let $c_n \stackrel{\text{\tiny def}}{=} h_{n-1} - h_n$ for $n \geq 1$, then ${\bm{H}}$ also has the equivalent form

Figures (8)

Figure 1: Illustrations of the eligibility curves for (a) $\lambda$-return, (b) $n$-step return, (c) inverted U-shape assignment inspired by klopf1972brain, and (d) time-delayed $\lambda$-return. The horizontal axis represents the elapsed time since the stimulus. Neither (c) nor (d) satisfy the recency heuristic.
Figure 2: (Left) MRP for \ref{['counterexample:pulse']}; rewards are zero. (Center) Credit-assignment function for delayed TD(0). (Right) Expected update directions of \ref{['eq:delayed_pulse']} for $\tau=1$, $\gamma=0.9$, $p=0.4$.
Figure 3: Hierarchical relationship between different return estimators. A return satisfies the weak recency heuristic if and only if it is a convex return: i.e., a compound or $n$-step return.
Figure 4: The 19-state random walk sutton2018reinforcement.
Figure 5: Impulse responses of $\lambda$-returns with varying degrees of sparsity.
...and 3 more figures

Theorems & Definitions (15)

Definition 3.1: Weak Recency Heuristic
Definition 3.2: Strong Recency Heuristic
Definition 5.1
Proposition 5.1
proof
Proposition 5.2
proof
Proposition 6.1
proof
Proposition A.1
...and 5 more

Demystifying the Recency Heuristic in Temporal-Difference Learning

TL;DR

Abstract

Demystifying the Recency Heuristic in Temporal-Difference Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (15)