Sequential Knockoffs for Variable Selection in Reinforcement Learning

Tao Ma; Jin Zhu; Hengrui Cai; Zhengling Qi; Yunxiao Chen; Chengchun Shi; Eric B. Laber

Sequential Knockoffs for Variable Selection in Reinforcement Learning

Tao Ma, Jin Zhu, Hengrui Cai, Zhengling Qi, Yunxiao Chen, Chengchun Shi, Eric B. Laber

TL;DR

This work tackles the problem of learning in RL with high-dimensional, potentially non-Markovian state representations by introducing the minimal sufficient state, a parsimonious substate that preserves the Markov property and the reward structure. It then proposes SEEK, a sequential knockoffs framework that identifies the minimal sufficient state in complex nonlinear MDPs by handling temporal dependence via data splitting and action-wise knockoffs, while providing FDR control and power guarantees. Theoretical results show SEEK consistently recovers the minimal sufficient state under beta-mixing and related conditions, and simulations plus real-data analyses (MIMIC-III and OhioT1DM) demonstrate substantial improvements in variable selection accuracy and downstream policy performance. Practically, SEEK enables reliable state reduction and interpretable policy learning in offline RL, with robust performance across diverse domains and strong theoretical guarantees for error control and power.

Abstract

In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state may slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the subvector of the original state under which the process remains an MDP and shares the same reward function as the original process. We propose a novel SEquEntial Knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method achieves selection consistency. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy learning. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods regarding variable selection accuracy and regret.

Sequential Knockoffs for Variable Selection in Reinforcement Learning

TL;DR

Abstract

Paper Structure (58 sections, 21 theorems, 101 equations, 9 figures, 13 tables, 4 algorithms)

This paper contains 58 sections, 21 theorems, 101 equations, 9 figures, 13 tables, 4 algorithms.

Introduction
Related Work
Organization of the Paper
Preliminaries: Contextual Bandits, Variable Selection with Knockoffs and Challenges of Adapting to RL
Minimal Sufficient State and Model-based Selections
SEEK: Sequential Knockoffs for Variable Selection
Theoretical Results
FDR and Type-I Error
Power Analysis
Simulation Experiments
Experiment Design, Benchmarks, and Evaluation Metrics
Results
Analysis of the MIMIC-III Dataset
Additional Numerical Details
Balancing Type-I and Type-II errors
...and 43 more sections

Key Result

Proposition 1

If $\mathbf{S}_{G}$ is a sufficient state, then there exists an optimal policy depending only on $\mathbf{S}_G$ that maximizes the $\gamma$-discounted expected cumulative reward.

Figures (9)

Figure 1: Receiver operating characteristic (ROC) curves for different methods in the toy example.
Figure 2: Two examples of dependence graph among reward and true state variables.
Figure A3: The FDR, TPR, VD of SEEK and SEEK-Alpha. We use the CP environment and manually inject 96 white noises into the state to create a 100-dimensional state system.
Figure A4: Performance of methods in the CartPole-v0 environment across 100 simulation runs. Each row represents an evaluation criterion. Columns 1 and 3 show results for $N=100$, while other columns depict results for $N=200$. Columns 1-2 correspond to settings with independent null states, and columns 3-4 correspond to settings with the AR(1)-structure null states. The cumulative rewards of policies learned by the all states are summarized in Table \ref{['tab:cartpole-drl-value']}.
Figure A5: Results of the FDR using the proposed SEEK methods under the CP environment with $N=100$, $p=100$, $\alpha = 0.5$, and AR(1) noises. All the results are aggregated over 20 runs.
...and 4 more figures

Theorems & Definitions (39)

Definition 1: Sufficient State
Proposition 1
Proposition 2
Definition 2: Minimal Sufficient State
Proposition 3
Proposition 4
Remark 1
Theorem 1: FDR
Theorem 2: Type-I error
Theorem 3: TPR
...and 29 more

Sequential Knockoffs for Variable Selection in Reinforcement Learning

TL;DR

Abstract

Sequential Knockoffs for Variable Selection in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (39)