Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making

Xin Chen; I-Hong Hou

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making

Xin Chen, I-Hong Hou

TL;DR

A novel multi-armed bandits framework that can model both the internal state transitions of each arm and the influence of external global environmental contexts for complex online decision-making in smart grids is introduced.

Abstract

This paper introduces a novel multi-armed bandits framework, termed Contextual Restless Bandits (CRB), for complex online decision-making. This CRB framework incorporates the core features of contextual bandits and restless bandits, so that it can model both the internal state transitions of each arm and the influence of external global environmental contexts. Using the dual decomposition method, we develop a scalable index policy algorithm for solving the CRB problem, and theoretically analyze the asymptotical optimality of this algorithm. In the case when the arm models are unknown, we further propose a model-based online learning algorithm based on the index policy to learn the arm models and make decisions simultaneously. Furthermore, we apply the proposed CRB framework and the index policy algorithm specifically to the demand response decision-making problem in smart grids. The numerical simulations demonstrate the performance and efficiency of our proposed CRB approaches.

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making

TL;DR

Abstract

Paper Structure (17 sections, 2 theorems, 29 equations, 3 figures, 3 algorithms)

This paper contains 17 sections, 2 theorems, 29 equations, 3 figures, 3 algorithms.

Introduction
Problem Formulation
Contextual Restless Bandits (CRB)
Application of CRB to Demand Response
Index Policy Design via Dual Decomposition
Dual Decomposition of Primal Problem
Solution of Sub-Problems and Expectation Computation
Index Policy for Primal Problem with Known Models
Online CRB Learning with Unknown Models
Index Policy for Primal
Asymptotic Optimality Analysis
Numerical Simulations
Convergence of Dual Decomposition
Asymptotic Optimality of Index Policy
Performance Comparison with Restless Bandits
...and 2 more sections

Key Result

Lemma 1

Given the initial global context $g_0 = g$ and suppose that the initial state $s_{i,0}$ of each arm $i\in [N]$ is chosen independently with the distribution $\mathbb{P}(s_{i,0}=s) = m^*_g(s)$, then, under the policy $\pi^*_{\mathrm{Rel}}$, we have

Figures (3)

Figure 1: Convergence of the Lagrange multiplier $\bm{\lambda}\!:=\!(\lambda_g)_{g\in\mathcal{G}}$ with the dual decomposition method.
Figure 2: Comparison between the per-user reward $V^N_{\mathrm{Rel}}/N$ and $V^N_{\mathrm{Ind}}/N$ of the Relaxed problem \ref{['eq:relax']} and of the index policy (Algorithm \ref{['alg:index_policy']}).
Figure 3: Comparison of the total discounted rewards between the CRB method and the traditional restless bandits method.

Theorems & Definitions (2)

Lemma 1
Theorem 1

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making

TL;DR

Abstract

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (2)