Replicable Reinforcement Learning with Linear Function Approximation
Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
TL;DR
This paper tackles the problem of ensuring exact replicability in reinforcement learning when using linear function approximation in linear MDPs. It introduces two replicable tools—replicable ridge regression and replicable uncentered covariance estimation—and builds replicable RL algorithms for both generative-model and episodic settings (R-LSVI with core sets and R-LSVI-UCB) atop these tools. The authors provide finite-sample replicability guarantees and concrete sample complexities, supported by experiments on real datasets and neural Q-value quantization to illustrate practical stability benefits. Overall, the work advances reliable RL by enabling exact repeatability across runs, paving the way for safer and more auditable AI systems while outlining directions for future feature-learning and scalability to non-linear regimes.
Abstract
Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.
