Truncated Variance Reduced Value Iteration

Yujia Jin; Ishani Karmarkar; Aaron Sidford; Jiayi Wang

Truncated Variance Reduced Value Iteration

Yujia Jin, Ishani Karmarkar, Aaron Sidford, Jiayi Wang

TL;DR

The paper tackles computing $\varepsilon$-optimal policies in discounted MDPs with $A_{ ext{tot}}$ state-action pairs by introducing truncated variance-reduced value iteration. The core idea fuses recursive variance reduction with a truncation mechanism to limit per-iteration changes, enabling tighter concentration and fewer samples via Freedman-type bounds. In the offline setting, the method achieves $\tilde{O}(\mathrm{nnz}(\mathbf{P}) + \mathcal{A}_{\text{tot}}(1-\gamma)^{-2})$ time with $\tilde{O}(\mathcal{A}_{\text{tot}})$ space, while in the sampling setting it attains $\tilde{O}(\mathcal{A}_{\text{tot}}[(1-\gamma)^{-3}\varepsilon^{-2} + (1-\gamma)^{-2}])$ samples and time, and computes $\varepsilon$-optimal values as well. These results improve over prior variance-reduced methods by reducing the additive $(1-\gamma)^{-3}$ term to $(1-\gamma)^{-2}$, and they bridge part of the gap between model-free and model-based approaches by maintaining low memory and near-linear-time performance. The techniques include monotone underestimates, telescoping variance estimates, and problem-dependent refinements, with specialized variants for deterministic, small-range, and highly mixing MDPs. Collectively, the work advances practical and theoretical efficiency in computing near-optimal policies for large-scale DMDPs using generative-model access or known transitions.

Abstract

We provide faster randomized algorithms for computing an $ε$-optimal policy in a discounted Markov decision process with $A_{\text{tot}}$-state-action pairs, bounded rewards, and discount factor $γ$. We provide an $\tilde{O}(A_{\text{tot}}[(1 - γ)^{-3}ε^{-2} + (1 - γ)^{-2}])$-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in $\tilde{O}(1)$-time, and an $\tilde{O}(s + (1-γ)^{-2})$-time algorithm in the offline setting where the probability transition matrix is known and $s$-sparse. These results improve upon the prior state-of-the-art which either ran in $\tilde{O}(A_{\text{tot}}[(1 - γ)^{-3}ε^{-2} + (1 - γ)^{-3}])$ time [Sidford, Wang, Wu, Ye 2018] in the sampling setting, $\tilde{O}(s + A_{\text{tot}} (1-γ)^{-3})$ time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in $\tilde{O}(A_{\text{tot}})$-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

Truncated Variance Reduced Value Iteration

TL;DR

The paper tackles computing

-optimal policies in discounted MDPs with

state-action pairs by introducing truncated variance-reduced value iteration. The core idea fuses recursive variance reduction with a truncation mechanism to limit per-iteration changes, enabling tighter concentration and fewer samples via Freedman-type bounds. In the offline setting, the method achieves

time with

space, while in the sampling setting it attains

samples and time, and computes

-optimal values as well. These results improve over prior variance-reduced methods by reducing the additive

term to

, and they bridge part of the gap between model-free and model-based approaches by maintaining low memory and near-linear-time performance. The techniques include monotone underestimates, telescoping variance estimates, and problem-dependent refinements, with specialized variants for deterministic, small-range, and highly mixing MDPs. Collectively, the work advances practical and theoretical efficiency in computing near-optimal policies for large-scale DMDPs using generative-model access or known transitions.

Abstract

We provide faster randomized algorithms for computing an

-optimal policy in a discounted Markov decision process with

-state-action pairs, bounded rewards, and discount factor

. We provide an

-time algorithm in the sampling setting, where the probability transition matrix is unknown but accessible through a generative model which can be queried in

-time, and an

-time algorithm in the offline setting where the probability transition matrix is known and

-sparse. These results improve upon the prior state-of-the-art which either ran in

time [Sidford, Wang, Wu, Ye 2018] in the sampling setting,

time [Sidford, Wang, Wu, Yang, Ye 2018] in the offline setting, or time at least quadratic in the number of states using interior point methods for linear programming. We achieve our results by building upon prior stochastic variance-reduced value iteration methods [Sidford, Wang, Wu, Yang, Ye 2018]. We provide a variant that carefully truncates the progress of its iterates to improve the variance of new variance-reduced sampling procedures that we introduce to implement the steps. Our method is essentially model-free and can be implemented in

-space when given generative model access. Consequently, our results take a step in closing the sample-complexity gap between model-free and model-based methods.

Paper Structure (21 sections, 16 theorems, 63 equations, 2 tables)

This paper contains 21 sections, 16 theorems, 63 equations, 2 tables.

Introduction
Our results
Exact DMDP Algorithms.
Comparison with IPM Approaches.
Overview of approach
Value iteration.
Stochastic value iteration and variance reduction.
Recursive variance reduction.
Truncated-value iteration.
Our method.
Notation and paper outline
General notation.
DMDP.
Outline.
Offline algorithm
...and 6 more sections

Key Result

Theorem 1.1

In the sample setting, there is an algorithm that uses $\tilde{O}(\mathcal{A}_{\mathrm{tot}}[(1-\gamma)^{-3} \varepsilon^{-2} + (1 - \gamma)^{-2}])$ samples and time and $O(\mathcal{A}_{\mathrm{tot}})$ space, and computes an $\varepsilon$-optimal policy and $\varepsilon$-optimal values with probabil

Theorems & Definitions (26)

Theorem 1.1
Theorem 1.2
Lemma 1.3
proof
Lemma 2.0
proof
Theorem 2.1: Freedman's Inequality, restated from tropp2011freedman
Lemma 2.2
proof
Corollary 2.2
...and 16 more

Truncated Variance Reduced Value Iteration

TL;DR

Abstract

Truncated Variance Reduced Value Iteration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (26)