Table of Contents
Fetching ...

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal

TL;DR

We address learning in infinite-horizon discounted MDPs with an unknown parameter θ under Thompson Sampling. Standard regret can be misleading in general state-control spaces, motivating a forward-looking metric called the expected residual regret (ERR) and its probabilistic counterpart. By linking ERR to asymptotic discount optimality (ADO) and temporal-difference error, and under a relative-entropy separation condition, we prove ERR decays exponentially and that the posterior concentrates on the true parameter almost surely (complete learning), with probabilistic ERR vanishing as well. This work broadens learning guarantees for sampling-based policies to general Borel spaces under discounting, providing robust criteria for evaluating and designing learning in complex dynamic environments.

Abstract

We model a Markov decision process, parametrized by an unknown parameter, and study the asymptotic behavior of a sampling-based algorithm, called Thompson sampling. The standard definition of regret is not always suitable to evaluate a policy, especially when the underlying chain structure is general. We show that the standard (expected) regret can grow (super-)linearly and fails to capture the notion of learning in realistic settings with non-trivial state evolution. By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret, which forgets the immutable consequences of past actions. Instead, it measures regret against the optimal reward moving forward from the current period. We show that the expected residual regret of the Thompson sampling algorithm is upper bounded by a term which converges exponentially fast to 0. We present conditions under which the posterior sampling error of Thompson sampling converges to 0 almost surely. We then introduce the probabilistic version of the expected residual regret and present conditions under which it converges to 0 almost surely. Thus, we provide a viable concept of learning for sampling algorithms which will serve useful in broader settings than had been considered previously.

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

TL;DR

We address learning in infinite-horizon discounted MDPs with an unknown parameter θ under Thompson Sampling. Standard regret can be misleading in general state-control spaces, motivating a forward-looking metric called the expected residual regret (ERR) and its probabilistic counterpart. By linking ERR to asymptotic discount optimality (ADO) and temporal-difference error, and under a relative-entropy separation condition, we prove ERR decays exponentially and that the posterior concentrates on the true parameter almost surely (complete learning), with probabilistic ERR vanishing as well. This work broadens learning guarantees for sampling-based policies to general Borel spaces under discounting, providing robust criteria for evaluating and designing learning in complex dynamic environments.

Abstract

We model a Markov decision process, parametrized by an unknown parameter, and study the asymptotic behavior of a sampling-based algorithm, called Thompson sampling. The standard definition of regret is not always suitable to evaluate a policy, especially when the underlying chain structure is general. We show that the standard (expected) regret can grow (super-)linearly and fails to capture the notion of learning in realistic settings with non-trivial state evolution. By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret, which forgets the immutable consequences of past actions. Instead, it measures regret against the optimal reward moving forward from the current period. We show that the expected residual regret of the Thompson sampling algorithm is upper bounded by a term which converges exponentially fast to 0. We present conditions under which the posterior sampling error of Thompson sampling converges to 0 almost surely. We then introduce the probabilistic version of the expected residual regret and present conditions under which it converges to 0 almost surely. Thus, we provide a viable concept of learning for sampling algorithms which will serve useful in broader settings than had been considered previously.
Paper Structure (29 sections, 9 theorems, 127 equations, 6 figures)

This paper contains 29 sections, 9 theorems, 127 equations, 6 figures.

Key Result

Lemma 4.1

TS is equivalent to the $\theta$-optimal policy when the prior distribution is degenerate, i.e., $\pi_0(\theta\mid h_0)=1$.

Figures (6)

  • Figure 1: Constant reward depending only on the first control, picked at t=0.
  • Figure 2: Evolution of the stochastic process, in the case when $\theta$ is not known.
  • Figure 3: Illustration of two different sample paths giving rise to the expected state regret.
  • Figure 4: Deterministic reward depending on the control, where true parameter is B.
  • Figure 5: Stochastic rewards depending on the control, where the true parameter is B.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Example 2.1: Expected regret grows linearly
  • Remark
  • Remark
  • Definition
  • Definition
  • Definition
  • Definition
  • Lemma 4.1: Degenerate prior
  • Definition
  • Example 5.1: Absorption into an unfavorable set of states
  • ...and 14 more