Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

Wooseong Cho; Taehyun Hwang; Joongkyu Lee; Min-hwan Oh

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

Wooseong Cho, Taehyun Hwang, Joongkyu Lee, Min-hwan Oh

TL;DR

To the best of the knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode.

Abstract

We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency. We establish that $\texttt{RRL-MNL}$ achieves a $\tilde{O}(κ^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $κ$ is a problem-dependent constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its regret bound scales with $κ^{-1}$, which is potentially large in the worst case. To improve the dependence on $κ^{-1}$, we propose $\texttt{ORRL-MNL}$, which estimates the value function using the local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + κ^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode. Numerical experiments demonstrate the superior performance of the proposed algorithms.

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

TL;DR

To the best of the knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode.

Abstract

, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency. We establish that

achieves a

frequentist regret bound with constant-time computational cost per episode. Here,

is the dimension of the transition core,

is the horizon length,

is the total number of steps, and

is a problem-dependent constant. Despite the simplicity and practicality of

, its regret bound scales with

, which is potentially large in the worst case. To improve the dependence on

, we propose

, which estimates the value function using the local gradient information of the MNL transition model. We show that its frequentist regret bound is

. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode. Numerical experiments demonstrate the superior performance of the proposed algorithms.

Paper Structure (64 sections, 40 theorems, 300 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 64 sections, 40 theorems, 300 equations, 2 figures, 1 table, 3 algorithms.

Introduction
Problem Setting
Multinomial Logistic Markov Decision Processes (MNL-MDPs)
Assumptions
Discussion of assumptions
Randomized Algorithm for MNL-MDPs having constant-time computational cost
Algorithm: RRL-MNL
Online transition core estimation
Stochastically optimistic value function
Regret bound of RRL-MNL
Discussion of Theorem \ref{['thm:alg 1']}
Proof Sketch of Theorem \ref{['thm:alg 1']}
Statistically Improved Algorithm for MNL-MDPs
Algorithms: ORRL-MNL
Tight online transition core estimation
...and 49 more sections

Key Result

Theorem 1

Suppose that Assumption assm:mnl-mdp- assm:positive kappa hold. For any $0 < \delta < \frac{\Phi(-1)}{2}$, if we set the input parameters in Algorithm alg:Algorithm 1 as $\lambda = L_{\boldsymbol{\varphi}}^2, \sigma_k = \widetilde{\mathcal{O}}(H \sqrt{d})$ and $M = \lceil 1 - \frac{\log H}{\log \Phi where $T=KH$ is the total number of steps.

Figures (2)

Figure 1: Riverswim experiment results
Figure 2: The "RiverSwim" environment with $n$ states osband2013more

Theorems & Definitions (46)

Remark 1
Remark 2
Theorem 1: Regret Bound of $\texttt{RRL-MNL}$
Remark 3
Remark 4
Remark 5
Theorem 2: Regret Bound of $\texttt{ORRL-MNL}$
Corollary 1
Definition 1: Prediction error & Bellman error
Proposition 1: Derivative of MNL transition model
...and 36 more

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

TL;DR

Abstract

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (46)