From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

Jingyi Zhao; Linxin Yang; Haohua Zhang; Tian Ding

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

Jingyi Zhao, Linxin Yang, Haohua Zhang, Tian Ding

TL;DR

The paper tackles the intractable second-stage dynamic programs in scenario-based stochastic programming by introducing a GPU-based, scenario-batched dynamic-programming framework. It reframes DP recursions as batched min-plus computations and implements hardware-aware kernels that expose parallelism across scenarios, DP layers, and action choices, enabling Bellman updates over more than $10^6$ realizations in a single pass. Two instantiations demonstrate the approach: a giant-tour split DP for capacitated vehicle routing with stochastic demand (CVRPSD) and a forward inventory reinsertion DP for dynamic stochastic IRP, each leveraging 2D/3D GPU parallelism to achieve near-linear scaling and large speedups over CPU baselines. Experimental results show that larger scenario sets reduce estimation bias and improve decision quality, and that GPU acceleration translates directly into better first-stage decisions within fixed time budgets. This work establishes a practical path to large-scale, realistic stochastic discrete optimization by delivering full-fidelity second-stage evaluation at scales previously considered infeasible.

Abstract

A major bottleneck in scenario-based Sample Average Approximation (SAA) for stochastic programming (SP) is the cost of solving an exact second-stage problem for every scenario, especially when each scenario contains an NP-hard combinatorial structure. This has led much of the SP literature to restrict the second stage to linear or simplified models. We develop a GPU-based framework that makes full-fidelity integer second-stage models tractable at scale. The key innovation is a set of hardware-aware, scenario-batched GPU kernels that expose parallelism across scenarios, dynamic-programming (DP) layers, and route or action options, enabling Bellman updates to be executed in a single pass over more than 1,000,000 realizations. We evaluate the approach in two representative SP settings: a vectorized split operator for stochastic vehicle routing and a DP for inventory reinsertion. Implementation scales nearly linearly in the number of scenarios and achieves a one-two to four-five orders of magnitude speedup, allowing far larger scenario sets and reliably stronger first-stage decisions. The computational leverage directly improves decision quality: much larger scenario sets and many more first-stage candidates can be evaluated within fixed time budgets, consistently yielding stronger SAA solutions. Our results show that full-fidelity integer second-stage models are tractable at scales previously considered impossible, providing a practical path to large-scale, realistic stochastic discrete optimization.

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

TL;DR

realizations in a single pass. Two instantiations demonstrate the approach: a giant-tour split DP for capacitated vehicle routing with stochastic demand (CVRPSD) and a forward inventory reinsertion DP for dynamic stochastic IRP, each leveraging 2D/3D GPU parallelism to achieve near-linear scaling and large speedups over CPU baselines. Experimental results show that larger scenario sets reduce estimation bias and improve decision quality, and that GPU acceleration translates directly into better first-stage decisions within fixed time budgets. This work establishes a practical path to large-scale, realistic stochastic discrete optimization by delivering full-fidelity second-stage evaluation at scales previously considered infeasible.

Abstract

Paper Structure (39 sections, 1 theorem, 39 equations, 13 figures, 4 tables, 5 algorithms)

This paper contains 39 sections, 1 theorem, 39 equations, 13 figures, 4 tables, 5 algorithms.

Introduction
Generic Dynamic Programming Framework
Preliminary Knowledge.
Transition-Based Formulation.
Min-Plus Matrix Formulation.
Instantiation A: Split DP on a Giant Tour in the Vehicle Routing Problem with Stochastic Demand.
Problem Motivation.
Forward DP Recursion and Matrix Form.
2D Parallelism on GPU.
Instantiation B: Forward Inventory Reinsertion DP in Dynamic Stochastic Inventory Routing Problems.
Problem Motivation.
Forward DP Recursion and Matrix Form.
3D Parallelism on GPU.
Experiments
Scaling the Scenario Size in Stochastic Programming.
...and 24 more sections

Key Result

Proposition 1

Let $\{ \tilde{\xi}^1, \dots, \tilde{\xi}^m \}$ be i.i.d. samples of $\tilde{\xi}$. Denote by the true and sample average problems, respectively, with optimal solutions $x^*$ and $x_m^*$. Then:

Figures (13)

Figure 1: 2D DP parallelism on GPU (scenarios $\times$ predecessors). Each row corresponds to a destination state $i$, each block within a row represents a scenario $\omega$, and colored bars indicate feasible predecessors $p<i$. For each $(i,\omega)$ pair, all predecessors are expanded in parallel to form the set $\{J^\omega(p)+A^\omega(p,i):\,p<i\}$, followed by a column-wise min-reduction over $p$ that yields $J^\omega(i)$.
Figure 2: 3D DP parallelism on GPU (scenarios $\times$ transitions $\times$ route options). Each layer corresponds to a stage $t$, with nodes representing end-of-day inventory levels $I$. Colored edges denote feasible transitions $I \to J$ under scenario-specific demands $d^{t,\omega}$. For each tuple $(\omega, I{\to}J, r)$, threads evaluate the cost contribution $J_i^t(I)+A_i^{t,\omega}(I,J;r)$, combining routing overhead with holding and stockout penalties. A two-level reduction (first across route options $r$, then across predecessor states $I$) yields $J_i^{t+1}(J)$ per scenario. The figure highlights how GPU parallelism spans scenarios, transitions, and route options, turning the DP recursion into a fully batched min--plus update.
Figure 3: Empirical behavior of SAA estimators in DSIRP. Top: bias under different demand distributions. Bottom: convergence with increasing scenario size.
Figure 5: Out-of-sample performance of first-stage solutions obtained with varying observed scenario settings. Larger evaluation set yield more robust and lower-cost solutions.
Figure 6: Quality of the best solution obtained at each time under a fixed time budget. GPU consistently achieves better decisions due to faster evaluation and thus larger effective search effort.
...and 8 more figures

Theorems & Definitions (1)

Proposition 1

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

TL;DR

Abstract

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (1)