Sequential Knockoffs for Variable Selection in Reinforcement Learning
Tao Ma, Jin Zhu, Hengrui Cai, Zhengling Qi, Yunxiao Chen, Chengchun Shi, Eric B. Laber
TL;DR
This work tackles the problem of learning in RL with high-dimensional, potentially non-Markovian state representations by introducing the minimal sufficient state, a parsimonious substate that preserves the Markov property and the reward structure. It then proposes SEEK, a sequential knockoffs framework that identifies the minimal sufficient state in complex nonlinear MDPs by handling temporal dependence via data splitting and action-wise knockoffs, while providing FDR control and power guarantees. Theoretical results show SEEK consistently recovers the minimal sufficient state under beta-mixing and related conditions, and simulations plus real-data analyses (MIMIC-III and OhioT1DM) demonstrate substantial improvements in variable selection accuracy and downstream policy performance. Practically, SEEK enables reliable state reduction and interpretable policy learning in offline RL, with robust performance across diverse domains and strong theoretical guarantees for error control and power.
Abstract
In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state may slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the subvector of the original state under which the process remains an MDP and shares the same reward function as the original process. We propose a novel SEquEntial Knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method achieves selection consistency. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy learning. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods regarding variable selection accuracy and regret.
