Table of Contents
Fetching ...

Algorithm-Relative Trajectory Valuation in Policy Gradient Control

Shihao Li, Jiachen Li, Jiamin Xu, Christopher Martin, Wei Li, Dongmei Chen

TL;DR

The paper addresses how trajectory value in policy-gradient control depends on the learning algorithm rather than data alone, focusing on an uncertain LQR and using Trajectory Shapley for attribution. It reveals a robust negative correlation between PE and trajectory value under vanilla REINFORCE, explained by a variance-mediated mechanism in which high PE reduces gradient variance and thus marginal value, while exploration benefits near saddles favor low-PE trajectories. Stabilization via state whitening or natural gradient neutralizes this variance channel and flips the correlation to positive, demonstrating algorithm-relativity in data valuation. The work advances practical data curation guidance (LOO vs Shapley) and highlights the need for algorithm-aware valuation frameworks in RL, with implications for active data collection and safety-critical control.

Abstract

We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE ($r\approx-0.38$). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive ($r\approx+0.29$). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.

Algorithm-Relative Trajectory Valuation in Policy Gradient Control

TL;DR

The paper addresses how trajectory value in policy-gradient control depends on the learning algorithm rather than data alone, focusing on an uncertain LQR and using Trajectory Shapley for attribution. It reveals a robust negative correlation between PE and trajectory value under vanilla REINFORCE, explained by a variance-mediated mechanism in which high PE reduces gradient variance and thus marginal value, while exploration benefits near saddles favor low-PE trajectories. Stabilization via state whitening or natural gradient neutralizes this variance channel and flips the correlation to positive, demonstrating algorithm-relativity in data valuation. The work advances practical data curation guidance (LOO vs Shapley) and highlights the need for algorithm-aware valuation frameworks in RL, with implications for active data collection and safety-critical control.

Abstract

We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE (). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive (). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.

Paper Structure

This paper contains 57 sections, 7 theorems, 41 equations, 2 figures, 2 tables.

Key Result

theorem 1

Under assumptions (A1)- (A3), there exists $C > 0$ such that Thus, for fixed energy, gradient variance is monotonically decreasing in PE.

Figures (2)

  • Figure 1: Illustration of the variance-mediated mechanism. Left: State-space trajectories with different PE levels. The low-PE trajectory (top) exhibits concentrated, nearly linear excitation along a primary direction, while the high-PE trajectory (bottom) shows more balanced, circular exploration across both dimensions (green ellipses indicate covariance structure). Middle: Scatter plots of gradient components $\nabla_{K_{ij}} J$, illustrating that low-PE trajectories produce high gradient variance (top, more dispersed) while high-PE trajectories yield low gradient variance (bottom, more concentrated). Right: Learning curves showing final performance. High gradient variance from low-PE data leads to high Shapley value (top, better final performance contribution), while low variance from high-PE data results in low Shapley value (bottom). The mechanism: Low PE → High Variance → High Shapley Value; High PE → Low Variance → Low Shapley Value.
  • Figure 2: PE-Shapley correlation under vanilla REINFORCE (left, $r = -0.380$) and state whitening (right, $r = +0.294$). State whitening stabilizes gradient variance across trajectories, neutralizing the variance-mediated mechanism and flipping the correlation from negative to positive, demonstrating algorithm-relativity (Theorem \ref{['thm:stabilized']}).

Theorems & Definitions (13)

  • theorem 1: High PE Yields Low Gradient Variance
  • theorem 2: High Variance Yields High Marginal Value
  • theorem 3: Stabilization Reverses the PE-Value Correlation
  • corollary 1: Stabilization Flip
  • lemma 1: Variance Bound via State Covariance
  • proof
  • lemma 2: PE Controls State Covariance
  • proof
  • proof : Proof of Theorem \ref{['thm:PE_to_Var']}
  • lemma 3: Escape Probability Increases with Noise
  • ...and 3 more