Table of Contents
Fetching ...

Structured Reinforcement Learning for Combinatorial Decision-Making

Heiko Hoppe, Léo Baty, Louis Bouvier, Axel Parmentier, Maximilian Schiffer

TL;DR

This work tackles reinforcement learning in combinatorial decision problems by embedding a combinatorial optimization layer into the actor (CO-layer) and enabling end-to-end training through Fenchel-Young losses. It offers a geometric interpretation as a sampling-based primal-dual method on the dual of the moment polytope and pairs this with a stable TD-based critic, including double Q-learning. Across six environments (static and dynamic) SRL matches or exceeds Structured Imitation Learning and unstructured PPO, achieving up to a 92% improvement on dynamic tasks while exhibiting lower variance and faster convergence, at the cost of higher CO-layer driven computation. The approach is well suited for industrial-scale planning with large combinatorial action spaces, where structure and end-to-end learning can yield substantial practical gains.

Abstract

Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic paradigm that embeds combinatorial optimization-layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.

Structured Reinforcement Learning for Combinatorial Decision-Making

TL;DR

This work tackles reinforcement learning in combinatorial decision problems by embedding a combinatorial optimization layer into the actor (CO-layer) and enabling end-to-end training through Fenchel-Young losses. It offers a geometric interpretation as a sampling-based primal-dual method on the dual of the moment polytope and pairs this with a stable TD-based critic, including double Q-learning. Across six environments (static and dynamic) SRL matches or exceeds Structured Imitation Learning and unstructured PPO, achieving up to a 92% improvement on dynamic tasks while exhibiting lower variance and faster convergence, at the cost of higher CO-layer driven computation. The approach is well suited for industrial-scale planning with large combinatorial action spaces, where structure and end-to-end learning can yield substantial practical gains.

Abstract

Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic paradigm that embeds combinatorial optimization-layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.

Paper Structure

This paper contains 82 sections, 1 theorem, 31 equations, 8 figures, 11 tables, 1 algorithm.

Key Result

Proposition 2

The actor update in the static version of Algorithm alg:SRL can be written as where $\delta_a$ is the Dirac distribution on $a$, $\Omega_\Delta$ is the negentropy, and $\Omega_{\varepsilon, \Delta}$ is the conjugate of the sparse perturbation, both detailed in Appendix app:proofs. Since $\hat{q}_{m}^{(t+\frac{1}{2})}$ is sparse by design, we discuss the definition of gradient

Figures (8)

  • Figure 1: Overview of the Structured Reinforcement Learning algorithm.
  • Figure 2: Left: action polytope $\mathcal{C}(s)= \mathop{\mathrm{conv}}\nolimits(\mathcal{A}(s))$. Middle: normal cone for which $f(\theta,s)=a_1$, right: normal cone $\mathcal{F}_{a_1}$ in dual space.
  • Figure 3: Schematic representation of SRL update step: unperturbed action $a$ (left), perturbed actions and target action $\widehat{a}$ (middle), Fenchel-Young loss $\mathcal{L}_\Omega(\theta;\widehat{a})$ (right).
  • Figure 4: DVSP results. Left: final train and test-performance compared to greedy ($\Delta^{\text{greedy}}$); right: validation performance during training; averaged over 10 random model initializations.
  • Figure 5: DAP and GSPP results. Left: final train and test-performance compared to greedy ($\Delta^{\text{greedy}}$); right: validation performance during training; averaged over 10 random model initializations.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 2
  • proof : Proof of Proposition \ref{['prop:relation_primal_dual_static']}