Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Lixing Lyu; Jiashuo Jiang; Wang Chi Cheung

Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Lixing Lyu, Jiashuo Jiang, Wang Chi Cheung

TL;DR

This work studies infinite-horizon discounted MDPs under a generative model with a prep-phase prediction hat{P} of the latent transition matrix. It proves an impossibility bound showing predictions cannot universally beat the known lower bound without knowledge of Dist, and introduces Optimistic Predict Mirror Descent (OPMD), a parameter-free primal-dual algorithm that blends prediction-based gradients with stochastic estimates. The analysis yields distance-aware sample complexity bounds that uniformly improve prior primal-dual results when the prediction is informative, while remaining robust to poor predictions. Empirical results on a simple MDP illustrate accelerated learning when hat{P} is accurate and robustness when it is not, highlighting practical benefits for prediction-informed RL in finite-state settings.

Abstract

We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $γ$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $ε$-optimal policy with a sample complexity bound better than $\tilde{O}((1-γ)^{-3} Nε^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $\tilde{O}((1-γ)^{-4} N ε^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

TL;DR

Abstract

Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)