Table of Contents
Fetching ...

A Novel Switch-Type Policy Network for Resource Allocation Problems: Technical Report

Jerrod Wigmore, Brooke Shrader, Eytan Modiano

TL;DR

This work tackles inefficiencies and poor generalization of DRL policies for queueing networks by introducing a switch-type neural network (STN) that enforces structure inspired by classical switch-type policies. By employing monotonic hidden layers with exponentiated weights and per-component input processing, the STN yields a stochastic switch-type policy that improves sample efficiency and generalization when trained with PPO. Empirical results show STN matches MLP performance on familiar environments and significantly outperforms it on unseen ones, with strong zero-shot generalization and multi-environment training efficiency. The findings suggest that the switch-type policy class is effective for a broad class of queueing network control problems and can enhance practical deployment of DRL in dynamic network control contexts.

Abstract

Deep Reinforcement Learning (DRL) has become a powerful tool for developing control policies in queueing networks, but the common use of Multi-layer Perceptron (MLP) neural networks in these applications has significant drawbacks. MLP architectures, while versatile, often suffer from poor sample efficiency and a tendency to overfit training environments, leading to suboptimal performance on new, unseen networks. In response to these issues, we introduce a switch-type neural network (STN) architecture designed to improve the efficiency and generalization of DRL policies in queueing networks. The STN leverages structural patterns from traditional non-learning policies, ensuring consistent action choices across similar states. This design not only streamlines the learning process but also fosters better generalization by reducing the tendency to overfit. Our works presents three key contributions: first, the development of the STN as a more effective alternative to MLPs; second, empirical evidence showing that STNs achieve superior sample efficiency in various training scenarios; and third, experimental results demonstrating that STNs match MLP performance in familiar environments and significantly outperform them in new settings. By embedding domain-specific knowledge, the STN enhances the Proximal Policy Optimization (PPO) algorithm's effectiveness without compromising performance, suggesting its suitability for a wide range of queueing network control problems.

A Novel Switch-Type Policy Network for Resource Allocation Problems: Technical Report

TL;DR

This work tackles inefficiencies and poor generalization of DRL policies for queueing networks by introducing a switch-type neural network (STN) that enforces structure inspired by classical switch-type policies. By employing monotonic hidden layers with exponentiated weights and per-component input processing, the STN yields a stochastic switch-type policy that improves sample efficiency and generalization when trained with PPO. Empirical results show STN matches MLP performance on familiar environments and significantly outperforms it on unseen ones, with strong zero-shot generalization and multi-environment training efficiency. The findings suggest that the switch-type policy class is effective for a broad class of queueing network control problems and can enhance practical deployment of DRL in dynamic network control contexts.

Abstract

Deep Reinforcement Learning (DRL) has become a powerful tool for developing control policies in queueing networks, but the common use of Multi-layer Perceptron (MLP) neural networks in these applications has significant drawbacks. MLP architectures, while versatile, often suffer from poor sample efficiency and a tendency to overfit training environments, leading to suboptimal performance on new, unseen networks. In response to these issues, we introduce a switch-type neural network (STN) architecture designed to improve the efficiency and generalization of DRL policies in queueing networks. The STN leverages structural patterns from traditional non-learning policies, ensuring consistent action choices across similar states. This design not only streamlines the learning process but also fosters better generalization by reducing the tendency to overfit. Our works presents three key contributions: first, the development of the STN as a more effective alternative to MLPs; second, empirical evidence showing that STNs achieve superior sample efficiency in various training scenarios; and third, experimental results demonstrating that STNs match MLP performance in familiar environments and significantly outperform them in new settings. By embedding domain-specific knowledge, the STN enhances the Proximal Policy Optimization (PPO) algorithm's effectiveness without compromising performance, suggesting its suitability for a wide range of queueing network control problems.
Paper Structure (27 sections, 1 theorem, 15 equations, 7 figures, 3 tables)

This paper contains 27 sections, 1 theorem, 15 equations, 7 figures, 3 tables.

Key Result

Lemma 1

For an MDP with state-space $\mathcal{S} = \mathcal{S}_1\times...\times \mathcal{S}_K$, let the function $\mathbf f:\mathcal{S}\mapsto \mathbb R^{K}$ be decomposable such that $\mathbf f(\mathbf s)=(f(\mathbf s_1), ..., f(\mathbf s_K))$ for some function $f$. If for all $k\in[K]$, $f:\mathcal{S}_k\m is a stochastic switch-type policy.

Figures (7)

  • Figure 1: Example of a single-hop scheduling environment. Packets arrive to each queue according to their independent arrival distributions. The service rate between each queue and the server varies according to an independent i.i.d. process. The server selects one of the $K$ queues to serve in each time-step.
  • Figure 2: Example of a multi-path routing environment. A single packet enters the network at the routing node labeled $v_R$. The policy determines which of the $K$ queues the packet is then routed to. Each server maintains its own seperate queue. After a packet is processed by any server, it is sent to node $v_D$ where it immediately leaves the network.
  • Figure 3: Decision regions for the policy $\pi_{PI}$. Blue denotes $\pi_{PI}(\mathbf s)=1$ and red denotes $\pi_{PI}(\mathbf s)=2$. The x-axis corresponds to $q_1$ and the y-axis corresponds to $q_2$. Each plot corresponds to a different set of service states $\mathbf y = (y_1, y_2)$. Thus the set of plots represents the decision region over the truncated state-space of $(q_1, q_2)\in[0, 20]\cap (y_1,y_2)\in[1,2]$
  • Figure 4: Moving average cost of the training policy versus training step. The left column corresponds to the single-hop scheduling training environments and the right column corresponds to the multi-path routing training environments.
  • Figure 5: Average cost of the trained policies in the same environments they were trained in.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Definition 3.1
  • Definition 3.2
  • Lemma 1