Stochastic Optimal Control with Side Information and Bayesian Learning

Johannes Milz; Alexander Shapiro; Enlu Zhou

Stochastic Optimal Control with Side Information and Bayesian Learning

Johannes Milz, Alexander Shapiro, Enlu Zhou

TL;DR

This work proposes a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation, and proves posterior consistency under Markov samples and uniform convergence of the Bayesian value function.

Abstract

We study infinite-horizon stochastic optimal control problems with observable side information: a Markov chain that modulates an unknown context-conditional randomness distribution. Since this distribution is unknown, we propose a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation. We prove posterior consistency under Markov samples and, under correct specification and identifiability, uniform convergence of the Bayesian value function. Finally, we establish Bernstein--von Mises-type asymptotic normality for the data-driven contextual optimal value.

Stochastic Optimal Control with Side Information and Bayesian Learning

TL;DR

Abstract

Paper Structure (12 sections, 7 theorems, 43 equations, 2 figures)

This paper contains 12 sections, 7 theorems, 43 equations, 2 figures.

Introduction
Problem Statement
Markovian contextual dynamics and policy simplification.
Bayesian reformulation.
Consistency
Notation and terminology.
Consistency of Bayesian posterior with Markov samples
Consistency of value functions
Asymptotics of the contextual optimal value
Bernstein--von Mises Limits for Markov chains
Asymptotics of the contextual optimal value
Acknowledgments

Key Result

Lemma 3.1

Under assump:consistency(i,v,vi), the following uniform LLN holds:

Figures (2)

Figure 1: Timeline of the data (context and randomness) process, system dynamics, and control processes. The context $\eta_t$ evolves according to the transition probability $\varpi_{\eta_t, \eta_{t+1}}$, and generates the randomness $\xi_t$ via $q(\cdot|\eta_t)$. The action $u_t = \pi(x_t, \eta_t)$ is chosen based on state and context, driving the system dynamics $x_{t+1}=F(x_t, u_t, \xi_t)$.
Figure 2: Schematic of the Bayesian learning and control pipeline given a dataset of size $N$. The accumulated historical data $\{(\xi_i, \eta_i)\}_{i=1}^N$ is used to construct the posterior $\mathsf{p}_N$, which defines the predictive expectation required to solve for the Bayesian value function $V_N^*$ and the corresponding optimal policy $\pi_N^*$. This process is repeated as new data is obtained.

Theorems & Definitions (19)

Remark 1
Remark 2
Lemma 3.1
proof
Lemma 3.2: Exponential decay away from $\Theta^*$
proof
Theorem 3.1: Posterior consistency
proof
Definition 3.1
Proposition 3.1
...and 9 more

Stochastic Optimal Control with Side Information and Bayesian Learning

TL;DR

Abstract

Stochastic Optimal Control with Side Information and Bayesian Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)