Table of Contents
Fetching ...

Reward-Relevance-Filtered Linear Offline Reinforcement Learning

Angela Zhou

TL;DR

This work targets offline reinforcement learning with linear function approximation under a causal sparsity regime where a sparse reward-relevant component governs decisions. The authors propose reward-filtered FQI, which first identifies the reward-relevant support via thresholded LASSO and then performs least-squares q-function estimation restricted to this sparse support. They establish finite-sample predictive guarantees and approximate Bellman completeness for the sparse function class, showing that policy quality depends on the sparse component size \\vert\\rho\\| rather than the full state dimension. Empirical results on a synthetic linear-MDP setting demonstrate improved q-estimation accuracy and better control of false positives compared to naive thresholded-LASSO FQI, highlighting the practical impact of leveraging reward-relevance structure for offline RL.

Abstract

This paper studies offline reinforcement learning with linear function approximation in a setting with decision-theoretic, but not estimation sparsity. The structural restrictions of the data-generating process presume that the transitions factor into a sparse component that affects the reward and could affect additional exogenous dynamics that do not affect the reward. Although the minimally sufficient adjustment set for estimation of full-state transition properties depends on the whole state, the optimal policy and therefore state-action value function depends only on the sparse component: we call this causal/decision-theoretic sparsity. We develop a method for reward-filtering the estimation of the state-action value function to the sparse component by a modification of thresholded lasso in least-squares policy evaluation. We provide theoretical guarantees for our reward-filtered linear fitted-Q-iteration, with sample complexity depending only on the size of the sparse component.

Reward-Relevance-Filtered Linear Offline Reinforcement Learning

TL;DR

This work targets offline reinforcement learning with linear function approximation under a causal sparsity regime where a sparse reward-relevant component governs decisions. The authors propose reward-filtered FQI, which first identifies the reward-relevant support via thresholded LASSO and then performs least-squares q-function estimation restricted to this sparse support. They establish finite-sample predictive guarantees and approximate Bellman completeness for the sparse function class, showing that policy quality depends on the sparse component size \\vert\\rho\\| rather than the full state dimension. Empirical results on a synthetic linear-MDP setting demonstrate improved q-estimation accuracy and better control of false positives compared to naive thresholded-LASSO FQI, highlighting the practical impact of leveraging reward-relevance structure for offline RL.

Abstract

This paper studies offline reinforcement learning with linear function approximation in a setting with decision-theoretic, but not estimation sparsity. The structural restrictions of the data-generating process presume that the transitions factor into a sparse component that affects the reward and could affect additional exogenous dynamics that do not affect the reward. Although the minimally sufficient adjustment set for estimation of full-state transition properties depends on the whole state, the optimal policy and therefore state-action value function depends only on the sparse component: we call this causal/decision-theoretic sparsity. We develop a method for reward-filtering the estimation of the state-action value function to the sparse component by a modification of thresholded lasso in least-squares policy evaluation. We provide theoretical guarantees for our reward-filtered linear fitted-Q-iteration, with sample complexity depending only on the size of the sparse component.
Paper Structure (27 sections, 9 theorems, 61 equations, 3 figures, 2 algorithms)

This paper contains 27 sections, 9 theorems, 61 equations, 3 figures, 2 algorithms.

Key Result

Proposition 1

When $s^\rho_t = \tilde{s}^\rho_t,$$\pi^*_t(s_t) =\tilde{\pi}^*_t(\tilde{s}_t).$

Figures (3)

  • Figure 1: Reward-relevant/irrelevant factored dynamics. The dotted line from $a_t$ to $s_{t+1}^{\rho_c}$ indicates the presence or absence is permitted in the model.
  • Figure 2: "Exogenous/endogenous MDP" of dietterich2018discovering.
  • Figure :

Theorems & Definitions (18)

  • Definition 1: Linear Bellman Completeness
  • Proposition 1: Sparse optimal policies
  • Proposition 2: Reward-sparse function classes are Bellman-complete.
  • Definition 2: Problem-dependent constants.
  • Theorem 1: Prediction error bound for reward-thresholded LASSO
  • Proposition 3: Bound on Bellman completeness violation under approximate recovery
  • Theorem 2
  • proof : Proof of \ref{['prop-policy-sparsity']}
  • proof : Proof of \ref{['prop-bellman-complete']}
  • Theorem 3: Prediction error bounds of $\mathcal{I}$-restricted ordinary least squares of the Bellman residual
  • ...and 8 more