Table of Contents
Fetching ...

Stochastic Optimal Control with Side Information and Bayesian Learning

Johannes Milz, Alexander Shapiro, Enlu Zhou

TL;DR

This work proposes a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation, and proves posterior consistency under Markov samples and uniform convergence of the Bayesian value function.

Abstract

We study infinite-horizon stochastic optimal control problems with observable side information: a Markov chain that modulates an unknown context-conditional randomness distribution. Since this distribution is unknown, we propose a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation. We prove posterior consistency under Markov samples and, under correct specification and identifiability, uniform convergence of the Bayesian value function. Finally, we establish Bernstein--von Mises-type asymptotic normality for the data-driven contextual optimal value.

Stochastic Optimal Control with Side Information and Bayesian Learning

TL;DR

This work proposes a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation, and proves posterior consistency under Markov samples and uniform convergence of the Bayesian value function.

Abstract

We study infinite-horizon stochastic optimal control problems with observable side information: a Markov chain that modulates an unknown context-conditional randomness distribution. Since this distribution is unknown, we propose a Bayesian reformulation based on a parametric density model and posterior predictive dynamics, which yields a Bayesian Bellman equation. We prove posterior consistency under Markov samples and, under correct specification and identifiability, uniform convergence of the Bayesian value function. Finally, we establish Bernstein--von Mises-type asymptotic normality for the data-driven contextual optimal value.
Paper Structure (12 sections, 7 theorems, 43 equations, 2 figures)

This paper contains 12 sections, 7 theorems, 43 equations, 2 figures.

Key Result

Lemma 3.1

Under assump:consistency(i,v,vi), the following uniform LLN holds:

Figures (2)

  • Figure 1: Timeline of the data (context and randomness) process, system dynamics, and control processes. The context $\eta_t$ evolves according to the transition probability $\varpi_{\eta_t, \eta_{t+1}}$, and generates the randomness $\xi_t$ via $q(\cdot|\eta_t)$. The action $u_t = \pi(x_t, \eta_t)$ is chosen based on state and context, driving the system dynamics $x_{t+1}=F(x_t, u_t, \xi_t)$.
  • Figure 2: Schematic of the Bayesian learning and control pipeline given a dataset of size $N$. The accumulated historical data $\{(\xi_i, \eta_i)\}_{i=1}^N$ is used to construct the posterior $\mathsf{p}_N$, which defines the predictive expectation required to solve for the Bayesian value function $V_N^*$ and the corresponding optimal policy $\pi_N^*$. This process is repeated as new data is obtained.

Theorems & Definitions (19)

  • Remark 1
  • Remark 2
  • Lemma 3.1
  • proof
  • Lemma 3.2: Exponential decay away from $\Theta^*$
  • proof
  • Theorem 3.1: Posterior consistency
  • proof
  • Definition 3.1
  • Proposition 3.1
  • ...and 9 more