Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Long-Fei Li; Peng Zhao; Zhi-Hua Zhou

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Long-Fei Li, Peng Zhao, Zhi-Hua Zhou

TL;DR

A new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and a new self-normalized concentration tailored specifically to handle non-independent noises are proposed.

Abstract

We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

TL;DR

Abstract

regret with high probability, where

is the dimension of feature mappings,

is the size of state space,

is the size of action space,

is the episode length and

is the number of episodes. Our result strictly improves the previous best-known

result in Zhao et al. (2023a) since

holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

Paper Structure (31 sections, 18 theorems, 84 equations, 1 table, 1 algorithm)

This paper contains 31 sections, 18 theorems, 84 equations, 1 table, 1 algorithm.

INTRODUCTION
RELATED WORK
RL with adversarial losses.
RL with linear function approximation.
RL with adversarial losses and linear function approximation.
PROBLEM SETUP
Episodic adversarial MDPs.
Linear Mixture MDPs.
Occupancy measure.
THE PROPOSED ALGORITHM
Transition Estimator
Loss Estimator
Online Mirror Descent
REGRET GUARANTEE
Regret Upper Bound
...and 16 more sections

Key Result

Lemma 1

Let $\{\mathcal{F}_t\}_{t=0}^\infty$ be a filtration. Let $\{\delta_t\}_{t=1}^\infty$ be an $\mathbb{R}^N$-valued stochastic process such that $\delta_t$ is $\mathcal{F}_t$-measurable one-hot vector. Furthermore, assume $\mathbb{E}[\delta_t | \mathcal{F}_{t-1}] = p_t$ and define $\varepsilon_t = p_t Then, for any $\zeta \in (0, 1)$, with probability at least $1 - \zeta$, we have for all $t \geq 1$

Theorems & Definitions (36)

Definition 1: Linear Mixture MDPs
Lemma 1
Lemma 2
Remark 1
Theorem 1
Remark 2
Remark 3
Lemma 3: Occupancy measure difference for linear mixture MDPs
Remark 4
Remark 5
...and 26 more

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

TL;DR

Abstract

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (36)