Partition Tree Weighting for Non-Stationary Stochastic Bandits
Joel Veness, Marcus Hutter, Andras Gyorgy, Jordi Grau-Moya
TL;DR
The paper addresses non-stationary stochastic bandits by reframing agent-environment interaction as universal source coding, distinguishing actions from observations to avoid self-delusion. It introduces ActivePTW, which combines KT-based per-arm estimators with Partition Tree Weighting to form a PTW-KTE environment and uses a Bayesian control-rule policy to sample actions. Theoretical results provide redundancy bounds and show the benefits of forced exploration, while experiments demonstrate that ActivePTW variants outperform several baselines across change-point regimes and often match Thompson Sampling in stationary settings. The work provides a principled, scalable route to universal control for non-stationary environments and suggests broader applicability of universal coding ideas to adaptive agents.
Abstract
This paper considers a generalisation of universal source coding for interaction data, namely data streams that have actions interleaved with observations. Our goal will be to construct a coding distribution that is both universal \emph{and} can be used as a control policy. Allowing for action generation needs careful treatment, as naive approaches which do not distinguish between actions and observations run into the self-delusion problem in universal settings. We showcase our perspective in the context of the challenging non-stationary stochastic Bernoulli bandit problem. Our main contribution is an efficient and high performing algorithm for this problem that generalises the Partition Tree Weighting universal source coding technique for passive prediction to the control setting.
