Autoregressive Policy Optimization for Constrained Allocation Tasks

David Winkel; Niklas Strauß; Maximilian Bernhard; Zongyue Li; Thomas Seidl; Matthias Schubert

Autoregressive Policy Optimization for Constrained Allocation Tasks

David Winkel, Niklas Strauß, Maximilian Bernhard, Zongyue Li, Thomas Seidl, Matthias Schubert

TL;DR

This paper proposes a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity, and introduces a novel de-biasing mechanism to counter the initial bias caused by sequential sampling.

Abstract

Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30\% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark. Our code is available at: https://github.com/niklasdbs/paspo

Autoregressive Policy Optimization for Constrained Allocation Tasks

TL;DR

Abstract

Paper Structure (22 sections, 1 theorem, 7 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 1 theorem, 7 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Problem Description
PASPO
Autoregressive Polytope Decomposition
Parameterizable Policy Process
Policy Network Architecture
De-biasing Mechanism
Experiments
Experimental Setup
Performance of PASPO
Importance of de-biased Initialization and Order
Limitations and Future Work
Conclusion
Environments
...and 7 more sections

Key Result

Theorem 1

Let $P=\{a \in \mathbb{R}^n| Ca\leq b\}\neq \emptyset$ be the convex polytope that corresponds to a constrained action space. Let $A$ be the set of all the points that can be generated by PASPO. It holds that $A=P$.

Figures (8)

Figure 1: Examples of 3-dimensional allocation action spaces (a) unconstrained and (b) constrained (valid solutions as red area).
Figure 2: Example of sampling process of an action $a=(a_1,a_2,a_3)$ in a 3-dimensional constrained allocation task.
Figure 3: The impact of initialization in an unconstrained simplex. (a) Mean allocations $a_i$ to each entity in a seven entity setup when sampling each individual allocation using the uniform distribution (red) vs. our initialization (blue). (b,c) Distribution of 2500 allocations in a three entity setup when sampling each individual allocation uniformly (b) or using beta distributions with parameters set according to our initialization (c).
Figure 4: Learning curves of all methods in three environments. The x-axis corresponds to the number of environment steps. The y-axis is the average episode reward (first row), and the number of constraint violations during every epoch (second row). For portfolio optimization (b) we report the performance running eight evaluation on 200 fixed market trajectories. This is because in training, every trajectory is different which makes comparisons hard. Curves smoothed for visualization.
Figure 5: Ablations in (a) show the performance of our approach with (blue) and without (orange) the de-biased initialization. In (b) depicts the impact of the allocation order. We reverse the allocation order (red).
...and 3 more figures

Theorems & Definitions (2)

Theorem 1
proof

Autoregressive Policy Optimization for Constrained Allocation Tasks

TL;DR

Abstract

Autoregressive Policy Optimization for Constrained Allocation Tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)