Table of Contents
Fetching ...

Partition Tree Weighting for Non-Stationary Stochastic Bandits

Joel Veness, Marcus Hutter, Andras Gyorgy, Jordi Grau-Moya

TL;DR

The paper addresses non-stationary stochastic bandits by reframing agent-environment interaction as universal source coding, distinguishing actions from observations to avoid self-delusion. It introduces ActivePTW, which combines KT-based per-arm estimators with Partition Tree Weighting to form a PTW-KTE environment and uses a Bayesian control-rule policy to sample actions. Theoretical results provide redundancy bounds and show the benefits of forced exploration, while experiments demonstrate that ActivePTW variants outperform several baselines across change-point regimes and often match Thompson Sampling in stationary settings. The work provides a principled, scalable route to universal control for non-stationary environments and suggests broader applicability of universal coding ideas to adaptive agents.

Abstract

This paper considers a generalisation of universal source coding for interaction data, namely data streams that have actions interleaved with observations. Our goal will be to construct a coding distribution that is both universal \emph{and} can be used as a control policy. Allowing for action generation needs careful treatment, as naive approaches which do not distinguish between actions and observations run into the self-delusion problem in universal settings. We showcase our perspective in the context of the challenging non-stationary stochastic Bernoulli bandit problem. Our main contribution is an efficient and high performing algorithm for this problem that generalises the Partition Tree Weighting universal source coding technique for passive prediction to the control setting.

Partition Tree Weighting for Non-Stationary Stochastic Bandits

TL;DR

The paper addresses non-stationary stochastic bandits by reframing agent-environment interaction as universal source coding, distinguishing actions from observations to avoid self-delusion. It introduces ActivePTW, which combines KT-based per-arm estimators with Partition Tree Weighting to form a PTW-KTE environment and uses a Bayesian control-rule policy to sample actions. Theoretical results provide redundancy bounds and show the benefits of forced exploration, while experiments demonstrate that ActivePTW variants outperform several baselines across change-point regimes and often match Thompson Sampling in stationary settings. The work provides a principled, scalable route to universal control for non-stationary environments and suggests broader applicability of universal coding ideas to adaptive agents.

Abstract

This paper considers a generalisation of universal source coding for interaction data, namely data streams that have actions interleaved with observations. Our goal will be to construct a coding distribution that is both universal \emph{and} can be used as a control policy. Allowing for action generation needs careful treatment, as naive approaches which do not distinguish between actions and observations run into the self-delusion problem in universal settings. We showcase our perspective in the context of the challenging non-stationary stochastic Bernoulli bandit problem. Our main contribution is an efficient and high performing algorithm for this problem that generalises the Partition Tree Weighting universal source coding technique for passive prediction to the control setting.

Paper Structure

This paper contains 43 sections, 4 theorems, 52 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2

For any stationary stochastic bandit problem $\left(A, \Theta, \mu \right)$, for all $n \in \mathbb{N}$, for all $e_{1:n} \in {\mathcal{E}}^n$ and for all $a_{1:n} \in {\mathcal{A}}^n$ we have where ${\mathcal{A}}':=\{a\in{\mathcal{A}}:\ell(e^a_{1:t})>0\}$ is the set of all actions taken at least once.

Figures (5)

  • Figure 1: Detailed Results. Each panel shows the final regret for all algorithms under different change-point rates and action space cardinality.
  • Figure 2: Average regret across 400 repeats, approximate 95% confidence intervals are indicated by shading, with $T=10^6$, $A=5$, and each segment length sampled from a geometric distribution with success probability $0.0002$.
  • Figure 3: A baseline comparison on stationary stochastic bandit problems. Approximate 95% confidence intervals are indicated by shading, and are computed using 400 runs of each algorithm. Note that the performance of Thompson Sampling and ActivePTW using an MEU reference policy is nearly identical and difficult to distinguish on the graph.
  • Figure 4: An adversarial example, with a single changepoint at $t=5000$. Approximate 95% confidence intervals are indicated by shading, and are computed using 1600 runs of each algorithm.
  • Figure 5: A labelled binary partition tree of depth 2.

Theorems & Definitions (6)

  • Definition 1: Cumulative Regret
  • Proposition 2: KTE Redundancy
  • Definition 3: veness13
  • Theorem 4
  • Lemma 5
  • Lemma 6: KT concentration