Partially Observable Contextual Bandits with Linear Payoffs

Sihan Zeng; Sujay Bhatt; Alec Koppel; Sumitra Ganesh

Partially Observable Contextual Bandits with Linear Payoffs

Sihan Zeng, Sujay Bhatt, Alec Koppel, Sumitra Ganesh

TL;DR

This work proposes an algorithmic pipeline named EMKF-Bandit, which integrates system identification, filtering, and classic contextual bandit algorithms into an iterative method alternating between latent parameter estimation and decision making and conducts numerical simulations that demonstrate the benefits and practical applicability of the proposed pipeline.

Abstract

The standard contextual bandit framework assumes fully observable and actionable contexts. In this work, we consider a new bandit setting with partially observable, correlated contexts and linear payoffs, motivated by the applications in finance where decision making is based on market information that typically displays temporal correlation and is not fully observed. We make the following contributions marrying ideas from statistical signal processing with bandits: (i) We propose an algorithmic pipeline named EMKF-Bandit, which integrates system identification, filtering, and classic contextual bandit algorithms into an iterative method alternating between latent parameter estimation and decision making. (ii) We analyze EMKF-Bandit when we select Thompson sampling as the bandit algorithm and show that it incurs a sub-linear regret under conditions on filtering. (iii) We conduct numerical simulations that demonstrate the benefits and practical applicability of the proposed pipeline.

Partially Observable Contextual Bandits with Linear Payoffs

TL;DR

Abstract

Paper Structure (13 sections, 10 theorems, 47 equations, 1 figure, 1 algorithm)

This paper contains 13 sections, 10 theorems, 47 equations, 1 figure, 1 algorithm.

Introduction
EMKF-Bandit Framework
Known Transition Model
Unknown Transition Model
Regret Analysis
Experimental Results
Preliminaries
Proof of Theorem 1
Instantaneous Regret Decomposition
Estimated Deviation
Sampling Deviation
Instantaneous Regret Bound
Total Regret Bound

Key Result

Theorem 1

Let $\varepsilon_t \triangleq x_t - \hat{x}_t$. Under the assumptions A1-A3 and the parameter choice $v_t=\sigma\sqrt{9d\ln(\frac{t}{\delta})}$, the regret of Algorithm alg:main satisfies with probability $1-\delta$

Figures (1)

Figure 1: Algorithm Performance Under Various Noise Levels

Theorems & Definitions (16)

Theorem 1
Lemma A.1: agrawal2013thompson, Lemma 8
Lemma A.2: auer2002using, Lemma 11
Lemma A.3: agrawal2013thompson, Lemma 6
Proposition 1
proof
Proposition 2
proof
Proposition 3
proof
...and 6 more

Partially Observable Contextual Bandits with Linear Payoffs

TL;DR

Abstract

Partially Observable Contextual Bandits with Linear Payoffs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (16)