Table of Contents
Fetching ...

Adaptive Optimization for Stochastic Renewal Systems

Michael J. Neely

TL;DR

A new algorithm is developed that is adaptive and comes within Θ(ϵ) of optimality for any interval of Θ(1/ϵ2) tasks over which probabilities are held fixed.

Abstract

This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let $A[k]$ be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task $k$. The goal is to observe $A[k]$ at the start of each new task $k$ and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a \emph{renewal optimization problem} and is challenging because the probability distribution for the $A[k]$ sequence is unknown. Prior work shows that any algorithm that comes within $ε$ of optimality must have $Ω(1/ε^2)$ convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within $O(ε)$ of optimality for any interval of $Θ(1/ε^2)$ tasks over which probabilities are held fixed, regardless of probabilities before the start of the interval.

Adaptive Optimization for Stochastic Renewal Systems

TL;DR

A new algorithm is developed that is adaptive and comes within Θ(ϵ) of optimality for any interval of Θ(1/ϵ2) tasks over which probabilities are held fixed.

Abstract

This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task . The goal is to observe at the start of each new task and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a \emph{renewal optimization problem} and is challenging because the probability distribution for the sequence is unknown. Prior work shows that any algorithm that comes within of optimality must have convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within of optimality for any interval of tasks over which probabilities are held fixed, regardless of probabilities before the start of the interval.
Paper Structure (32 sections, 11 theorems, 147 equations, 9 figures)

This paper contains 32 sections, 11 theorems, 147 equations, 9 figures.

Key Result

Lemma 1

Suppose $\{A[k]\}_{k=1}^{\infty}$ are i.i.d. and satisfy the boundedness assumptions eq:bound1-eq:bound4. Then a) For every $(t,r,y) \in \Gamma$ and $k \in \{1, 2, 3, \ldots\}$, there exists a decision vector $(T^*[k], R^*[k], Y^*[k]) \in Row(A[k])$ that is independent of $H[k]$ and that satisfies ( b) If $\{(T[k], R[k], Y[k])\}_{k=1}^{\infty}$ is a sequence of decision vectors from a causal decis

Figures (9)

  • Figure 1: Four sequential tasks in the timeline. Vertical arrows for each task $k$ represent values for reward $R[k]$ and penalty $Y[k]$. In this example, green is reward (profit), red is energy, blue is quality. The duration of task $k$ and the height of its arrows depend on choices made at the start of task $k$.
  • Figure 2: System 1: Accumulated reward per unit time for the proposed adaptive algorithm (with $v\in\{1,2,10\}$), the vanishing-stepsize Robbins-Monro algorithm; and the greedy algorithm. All data points are averaged over $40$ independent simulations.
  • Figure 3: System 1: Testing adaptation over a simulation of $2\times 10^4$ tasks with a distributional change introduced at the halfway point (task $10^4$). The two horizontal dashed lines represent optimal $\theta^*$ values for the two distributions. Each point for task $k_0$ is the result of a moving window average $\frac{\sum_{k=1}^{200}\mathbb{E}\left[R[k_0-k]\right]}{\sum_{k=1}^{200}\mathbb{E}\left[T[k_0-k]\right]}$, where expectations are obtained by averaging over $40$ independent simulations. The adaptive algorithm (with $v=10$) quickly adapts to the change. The Robbins-Monro algorithm takes a long time to adapt.
  • Figure 4: Time average reward up to task $k$ for the adaptive algorithm ($v\in\{10,50,200\}$); the DPP algorithm with ratio averaging; the greedy algorithm.
  • Figure 5: Corresponding time averaged power for the simulations of Fig. \ref{['fig:Rewardnew']}. The horizontal asymptote is $p_{av}=1/3$.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 12 more