Table of Contents
Fetching ...

Predictive Control and Regret Analysis of Non-Stationary MDP with Look-ahead Information

Ziyi Zhang, Yorie Nakahira, Guannan Qu

TL;DR

This paper proposes an algorithm designed to achieve low regret in non-stationary MDPs by incorporating look-ahead predictions, and demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands.

Abstract

Policy design in non-stationary Markov Decision Processes (MDPs) is inherently challenging due to the complexities introduced by time-varying system transition and reward, which make it difficult for learners to determine the optimal actions for maximizing cumulative future rewards. Fortunately, in many practical applications, such as energy systems, look-ahead predictions are available, including forecasts for renewable energy generation and demand. In this paper, we leverage these look-ahead predictions and propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating such predictions. Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands. When the system prediction is subject to error, the regret does not explode even if the prediction error grows sub-exponentially as a function of the prediction horizon. We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.

Predictive Control and Regret Analysis of Non-Stationary MDP with Look-ahead Information

TL;DR

This paper proposes an algorithm designed to achieve low regret in non-stationary MDPs by incorporating look-ahead predictions, and demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands.

Abstract

Policy design in non-stationary Markov Decision Processes (MDPs) is inherently challenging due to the complexities introduced by time-varying system transition and reward, which make it difficult for learners to determine the optimal actions for maximizing cumulative future rewards. Fortunately, in many practical applications, such as energy systems, look-ahead predictions are available, including forecasts for renewable energy generation and demand. In this paper, we leverage these look-ahead predictions and propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating such predictions. Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands. When the system prediction is subject to error, the regret does not explode even if the prediction error grows sub-exponentially as a function of the prediction horizon. We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.
Paper Structure (20 sections, 15 theorems, 59 equations, 4 figures, 1 algorithm)

This paper contains 20 sections, 15 theorems, 59 equations, 4 figures, 1 algorithm.

Key Result

Proposition 3.1

For any non-stationary MDP satisfying Assumption assumption:strong_connect, $L_t$ defined in equation defn:bellman is a $J$-stage contraction operator with contraction coefficient

Figures (4)

  • Figure 1: \ref{['fig:simulation_noiseless']}, \ref{['fig:simulation_a']}, and \ref{['fig:simulation_a2']} show the regret of MPDP under different additive prediction error $\mathcal{N}(0,\sigma)$. The red solid line shows the mean of the regret, and the shaded area shows the confidence interval.
  • Figure 2: \ref{['fig:simulation_b']} shows the average regret remains almost constant with variance of prediction error below 5 and starts to grow afterward. \ref{['fig:simulation_c']} shows the regret slowly increases with the growth rate of variance of the prediction error with respect to the prediction horizon.
  • Figure 3: \ref{['fig:context']}, \ref{['fig:context1']}, and \ref{['fig:context2']} show the regret of MPDP under different additive prediction error $\mathcal{N}(0,\sigma)$ in a setting with a finite number of changes in MDP's. The red solid line shows the mean of the regret, and the shaded area shows the confidence interval.
  • Figure 4: \ref{['fig:simulation_d']} shows the power usage at different time steps. The shaded area indicates the time period with energy prices below 8. We see that with $k \geq 7$, the most energy usage happens within the area with low energy prices, reducing the total energy cost of the station. \ref{['fig:simulation_e']} demonstrates that the regret of EV charging decays with the prediction horizon. Compared with traditional scheduling policies, the proposed algorithm can lower the total energy cost even with a few prediction steps.

Theorems & Definitions (31)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2: $J$-stage contraction
  • Proposition 3.1
  • Definition 3
  • Proposition 3.2
  • Proposition 3.3
  • Theorem 4.1
  • Corollary 4.2
  • ...and 21 more