Table of Contents
Fetching ...

Quantifying Memory Use in Reinforcement Learning with Temporal Range

Rodney Lafuente-Mercado, Daniela Rus, T. Konstantin Rusch

TL;DR

This work introduces Temporal Range, a model-agnostic metric that quantifies how far back a trained RL policy effectively looks by aggregating first-order temporal Jacobian norms into a magnitude-weighted average lag. It provides an axiomatic foundation for vector-output linear maps that yields unique unnormalized and normalized forms, and applies these ideas to nonlinear policies via local linearization; Temporal Range is invariant to uniform input and output rescalings. Empirically, TR is validated across POPGym diagnostics and control tasks on architectures including LEM, GRU, LSTM, and LinOSS, with results showing small ranges for fully observed control, larger ranges for memory-demanding tasks like Copy-$k$, and alignment with the minimal history window needed for near-optimal return. A compact LEM proxy enables TR computation when gradients are inaccessible and supports memory-efficient deployment by guiding the choice of minimal sufficient context. Overall, Temporal Range offers a practical, per-sequence readout of memory dependence that can inform architecture design, environment analysis, and resource-efficient deployment in reinforcement learning.

Abstract

How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in\{t+1,\dots,T\}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.

Quantifying Memory Use in Reinforcement Learning with Temporal Range

TL;DR

This work introduces Temporal Range, a model-agnostic metric that quantifies how far back a trained RL policy effectively looks by aggregating first-order temporal Jacobian norms into a magnitude-weighted average lag. It provides an axiomatic foundation for vector-output linear maps that yields unique unnormalized and normalized forms, and applies these ideas to nonlinear policies via local linearization; Temporal Range is invariant to uniform input and output rescalings. Empirically, TR is validated across POPGym diagnostics and control tasks on architectures including LEM, GRU, LSTM, and LinOSS, with results showing small ranges for fully observed control, larger ranges for memory-demanding tasks like Copy-, and alignment with the minimal history window needed for near-optimal return. A compact LEM proxy enables TR computation when gradients are inaccessible and supports memory-efficient deployment by guiding the choice of minimal sufficient context. Overall, Temporal Range offers a practical, per-sequence readout of memory dependence that can inform architecture design, environment analysis, and resource-efficient deployment in reinforcement learning.

Abstract

How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks averaged over final timesteps and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.

Paper Structure

This paper contains 46 sections, 2 theorems, 19 equations, 5 figures, 7 tables.

Key Result

Proposition A.1

Fix any matrix norm $\|\!\cdot\!\|_{\text{mat}}$ on $\mathbb{R}^{c\times d}$. There is a unique nonnegative map $\rho_T$ on linear maps obeying R1-u--R3, namely

Figures (5)

  • Figure 1: Window ablations. Normalized return vs. context window size $m$ across architectures. Dotted vertical lines show $\hat{\rho}_T$ values. Performance recovers when $m$ exceeds temporal range, confirming TR identifies minimum sufficient context. Note that $\hat{\rho}_T$ often aligns remarkably well with task requirements (e.g., GRU's $\hat{\rho}_T \approx 12$ for Copy $k=10$). When TR appears to fall short of the empirical peak (e.g., $\hat{\rho}_T=12$ while peak occurs at $m=16$), this is typically an artifact of our sparse window sampling ($m \in \{1,2,4,8,16,32\}$); the true performance peak likely lies between tested values, closer to the TR prediction.
  • Figure 2: Temporal influence profiles (LEM). Jacobian magnitude vs. steps back from current timestep. Profiles show how policy depends on observation history: CartPole concentrates on recent steps, Stateless CartPole distributes broadly, Copy-$k$ peaks near required lookback distance.
  • Figure 3: Temporal influence profiles for Copy $k=3$ across all architectures (LEM, GRU, LSTM, LinOSS).
  • Figure 4: Temporal influence profiles for Copy $k=10$ across all architectures (LEM, GRU, LSTM, LinOSS).
  • Figure 5: Temporal influence profiles for Noisy Stateless CartPole across all architectures (LEM, GRU, LSTM, LinOSS).

Theorems & Definitions (5)

  • Definition 3.1: Temporal range
  • Proposition A.1: Uniqueness of the unnormalized form for vector outputs
  • proof : Proof of Proposition \ref{['thm:unnormalized-unique-matrix']}
  • Proposition A.2: Uniqueness of the normalized form for vector outputs
  • proof : Proof of Proposition \ref{['thm:normalized-unique-matrix']}