Table of Contents
Fetching ...

Adaptive Discretization in Online Reinforcement Learning

Sean R. Sinclair, Siddhartha Banerjee, Christina Lee Yu

TL;DR

This work develops an algorithmic framework for nonparametric RL with data-driven adaptive discretization that has provably better sample, storage, and computational complexity than uniform discretization or kernel regression methods.

Abstract

Discretization based approaches to solving online reinforcement learning problems have been studied extensively in practice on applications ranging from resource allocation to cache management. Two major questions in designing discretization-based algorithms are how to create the discretization and when to refine it. While there have been several experimental results investigating heuristic solutions to these questions, there has been little theoretical treatment. In this paper we provide a unified theoretical analysis of tree-based hierarchical partitioning methods for online reinforcement learning, providing model-free and model-based algorithms. We show how our algorithms are able to take advantage of inherent structure of the problem by providing guarantees that scale with respect to the 'zooming dimension' instead of the ambient dimension, an instance-dependent quantity measuring the benignness of the optimal $Q_h^\star$ function. Many applications in computing systems and operations research requires algorithms that compete on three facets: low sample complexity, mild storage requirements, and low computational burden. Our algorithms are easily adapted to operating constraints, and our theory provides explicit bounds across each of the three facets. This motivates its use in practical applications as our approach automatically adapts to underlying problem structure even when very little is known a priori about the system.

Adaptive Discretization in Online Reinforcement Learning

TL;DR

This work develops an algorithmic framework for nonparametric RL with data-driven adaptive discretization that has provably better sample, storage, and computational complexity than uniform discretization or kernel regression methods.

Abstract

Discretization based approaches to solving online reinforcement learning problems have been studied extensively in practice on applications ranging from resource allocation to cache management. Two major questions in designing discretization-based algorithms are how to create the discretization and when to refine it. While there have been several experimental results investigating heuristic solutions to these questions, there has been little theoretical treatment. In this paper we provide a unified theoretical analysis of tree-based hierarchical partitioning methods for online reinforcement learning, providing model-free and model-based algorithms. We show how our algorithms are able to take advantage of inherent structure of the problem by providing guarantees that scale with respect to the 'zooming dimension' instead of the ambient dimension, an instance-dependent quantity measuring the benignness of the optimal function. Many applications in computing systems and operations research requires algorithms that compete on three facets: low sample complexity, mild storage requirements, and low computational burden. Our algorithms are easily adapted to operating constraints, and our theory provides explicit bounds across each of the three facets. This motivates its use in practical applications as our approach automatically adapts to underlying problem structure even when very little is known a priori about the system.

Paper Structure

This paper contains 50 sections, 40 theorems, 199 equations, 7 figures, 3 tables.

Key Result

Lemma 2.4

Suppose that assumption:Lipschitz_mb holds. Then assumption:Lipschitz_mf holds with $L_V = \sum_{h=0}^{H} L_r L_T^{h}$.

Figures (7)

  • Figure 1: Partitioning scheme for $\mathcal{S}\times\mathcal{A}=[0,1]^2$: In \ref{['fig:partition_diagram']}, we illustrate our scheme. Partition $\mathcal{P}^{k-1}_h$ is depicted with corresponding tree (showing active balls in green, inactive parents in red). The algorithm plays ball $B_{h-1}^k$ in step $h-1$, leading to new state $X_h^k$. Since $\ell(B_{h-1}^k)=2$, in AdaMB we store transition estimates $\overline{\mathbf{T}}_{h-1}^{k}(\cdot \mid B_{h-1}^k)$ for all subsets of $\mathcal{S}$ of diameter $2^{-2}$ denoted as $\mathcal{S}_{2}$ (depicted via dotted lines). The set of relevant balls $\textsc{Relevant}_h^k(X_h^k) = \{B_4,B_{21},B_{23}\}$ are highlighted in blue. $\mathcal{S}(\mathcal{P}_h^{k-1})$ here would be $\{[0,\frac{1}{2}],[\frac{1}{2}, \frac{3}{4}],[\frac{3}{4}, 1]\}$. In \ref{['fig:partition_practice']}, we show the partition $\mathcal{P}_{2}^K$ from one of our synthetic experiments. The colors denote the true $Q_2^\star$ values, with green corresponding to higher values. Note that the partition is more refined in areas which have higher $Q_2^\star$.
  • Figure 2: Comparison of the discretization observed between AdaMB and AdaQL for the oil environment with $d = 1$ at step $h = 2$. The underlying colors correspond to the true $Q_2^\star$ function where green corresponds to a higher value. In all of the results we see that the adaptive discretization algorithms maintain a level of discretization proportional to the underlying $Q_h^\star$ value. This leads to sample complexity gains as the algorithm quickly learns where the set of near-optimal state action pairs are, and space and time complexity improvements by not maintaining a fine discretization across the entire space. For the settings when $\alpha = 0$ where we have concrete zooming dimension improvements (as in \ref{['fig:1a']}) these zooming dimension guarantees justifies the improvements on the adaptive algorithms.
  • Figure 3: Comparison of the discretization observed between AdaMB and AdaQL for the ambulance environment with $k = 1$ at step $h = 2$. While the zooming dimension gives no improvements on the state-space dependence, empirically we see the algorithm only maintaining a discretization on states induced by the visitation distribution of the optimal policy.
  • Figure 4: Comparison of the performance (including average reward, time complexity, space complexity) between AdaMB, AdaQL, EpsMB, EpsQL, Random, and SB PPO for the oil environment with Laplace rewards. We see that the adaptive discretization algorithms outperform their uniform discretization counterparts, with AdaMB and AdaQL achieving similar levels of performance. SB PPO does not have enough episodes in order to learn any signal, so its performance is essentially that of a randomized algorithm. When $\alpha = 0$ (as in \ref{['fig:3a']}) we note that the zooming dimension gives improved guarantees for the sample complexity, providing potential justification of the improved performance of the adaptive algorithms.
  • Figure 5: Comparison of the performance (including average reward, time complexity, space complexity) between AdaMB, AdaQL, EpsMB, EpsQL, Random, and SB PPO for the oil environment with quadratic. When $d = 2$ (as in \ref{['fig:4e']}) we see that the adaptive algorithms drastically outperform all other algorithms. This can be attributed to the adaptive algorithms maintaining a smaller partition of the space, hence requiring exponentially less samples used for exploration.
  • ...and 2 more figures

Theorems & Definitions (77)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Lemma 2.4
  • Definition 2.5
  • Definition 2.6
  • Definition 2.7
  • Lemma 2.8
  • Lemma 2.9
  • Definition 3.1
  • ...and 67 more