Table of Contents
Fetching ...

Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

Lixing Lyu, Jiashuo Jiang, Wang Chi Cheung

TL;DR

This work studies infinite-horizon discounted MDPs under a generative model with a prep-phase prediction hat{P} of the latent transition matrix. It proves an impossibility bound showing predictions cannot universally beat the known lower bound without knowledge of Dist, and introduces Optimistic Predict Mirror Descent (OPMD), a parameter-free primal-dual algorithm that blends prediction-based gradients with stochastic estimates. The analysis yields distance-aware sample complexity bounds that uniformly improve prior primal-dual results when the prediction is informative, while remaining robust to poor predictions. Empirical results on a simple MDP illustrate accelerated learning when hat{P} is accurate and robustness when it is not, highlighting practical benefits for prediction-informed RL in finite-state settings.

Abstract

We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with $N$ state-action pairs and discounted factor $γ$. Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an $ε$-optimal policy with a sample complexity bound better than $\tilde{O}((1-γ)^{-3} Nε^{-2})$, which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than $\tilde{O}((1-γ)^{-4} N ε^{-2})$, the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

TL;DR

This work studies infinite-horizon discounted MDPs under a generative model with a prep-phase prediction hat{P} of the latent transition matrix. It proves an impossibility bound showing predictions cannot universally beat the known lower bound without knowledge of Dist, and introduces Optimistic Predict Mirror Descent (OPMD), a parameter-free primal-dual algorithm that blends prediction-based gradients with stochastic estimates. The analysis yields distance-aware sample complexity bounds that uniformly improve prior primal-dual results when the prediction is informative, while remaining robust to poor predictions. Empirical results on a simple MDP illustrate accelerated learning when hat{P} is accurate and robustness when it is not, highlighting practical benefits for prediction-informed RL in finite-state settings.

Abstract

We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with state-action pairs and discounted factor . Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an -optimal policy with a sample complexity bound better than , which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than , the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

Paper Structure

This paper contains 33 sections, 10 theorems, 67 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

Suppose $N \ge 6$, $\gamma \in \left[1/3,1\right)$, $\epsilon \in \left (0,(1-\gamma)^{-1}/40 \right]$, $\delta \in (0,0.24]$. Consider a fixed but arbitrary algorithm ALG. If ALG is $(\epsilon,\delta)$-smart on a specific DMDP instance $\mathcal{M}_0$ with $\mathcal{S}$, $\mathcal{A}$ with total $N then there exists another DMDP instance $\mathcal{M}'$with the same ${\cal S}, {\cal A}, \gamma$ as

Figures (3)

  • Figure 1: Duality Gap
  • Figure 2: Value Function
  • Figure 3: MDP Example

Theorems & Definitions (12)

  • Definition 1
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Definition 2
  • Lemma 3
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Lemma 4
  • ...and 2 more