Table of Contents
Fetching ...

Speeding up Policy Simulation in Supply Chain RL

Vivek Farias, Joren Gijsbrechts, Aryan Khojandi, Tianyi Peng, Andrew Zheng

TL;DR

This work tackles the bottleneck of policy evaluation in policy-optimization for large-scale supply chain RL by introducing Picard Iteration, an iterative, GPU-friendly scheme that partitions the horizon into multiple tasks and uses a cached action sequence to enable batched policy evaluation. The authors prove convergence with a small number of iterations in a broad class of SCO problems and demonstrate substantial practical speedups (up to ~400x on a single GPU) for Fulfillment Optimization, significantly accelerating end-to-end RL pipelines. Beyond SCO, they show promising results in OpenAI Gym MuJoCo environments, indicating potential generalization to other RL domains. The work also provides a strong empirical and theoretical comparison to Time Warp, showing order-of-magnitude advantages for the proposed approach. Overall, Picard Iteration offers a scalable, provably efficient path to faster policy evaluation and optimization in large-scale RL problems with long horizons.

Abstract

Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain "cached" evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.

Speeding up Policy Simulation in Supply Chain RL

TL;DR

This work tackles the bottleneck of policy evaluation in policy-optimization for large-scale supply chain RL by introducing Picard Iteration, an iterative, GPU-friendly scheme that partitions the horizon into multiple tasks and uses a cached action sequence to enable batched policy evaluation. The authors prove convergence with a small number of iterations in a broad class of SCO problems and demonstrate substantial practical speedups (up to ~400x on a single GPU) for Fulfillment Optimization, significantly accelerating end-to-end RL pipelines. Beyond SCO, they show promising results in OpenAI Gym MuJoCo environments, indicating potential generalization to other RL domains. The work also provides a strong empirical and theoretical comparison to Time Warp, showing order-of-magnitude advantages for the proposed approach. Overall, Picard Iteration offers a scalable, provably efficient path to faster policy evaluation and optimization in large-scale RL problems with long horizons.

Abstract

Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain "cached" evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.
Paper Structure (25 sections, 8 theorems, 29 equations, 5 figures, 4 tables)

This paper contains 25 sections, 8 theorems, 29 equations, 5 figures, 4 tables.

Key Result

Proposition 2.1

The Picard iteration converges in at most $T$ iterations and returns $\{a^{\rm seq}_t\}$.

Figures (5)

  • Figure 1: (a) Time Warp is a widely used method for simulating discrete event trajectories. It employs a message-passing algorithm where each processor (blue box) maintains a local time and processes events in parallel, potentially triggering new events. If a processor receives an event with a timestamp earlier than its local time, it must roll back, potentially causing a cascading rollback effect. This overhead makes Time Warp inefficient for general MDP trajectory simulation. (b) Instead, we observe that often for RL problems, the transition $f_{t}$ is computationally cheap while the policy $\pi$ is expensive. We propose Picard iteration by the following intuition: by initializing a trajectory of states (or actions), the policy $\pi$ can be executed in parallel. New trajectories are efficiently updated via lightweight transitions. This process can be iterated until convergence. Compared to Time Warp, this proposed Picard (1) is GPU-friendly, (2) enables new analyses for provable speedups, (3) achieves significant acceleration in practical SCO settings, and (4) demonstrates speedup potential for general RL.
  • Figure 2: Picard runtime on problem instances with uniformly distributed demand, as a function of the number of processes (batch size $M$). The $y$-axis normalizes computation time to $M=1$ (i.e., speedup). For $M=1e4$, we achieve a $441\times$ speedup relative to the sequential algorithm.
  • Figure 3: Convergence of the Picard iteration for Gym MuJoCo environments, measured in relative RMSE between the Picard trajectory and the sequentially simulated correct trajectory (normalized by RMSE of the draft trajectory $\{s_{t}^{0}\}_{t \in [T]}$). Solid line shows median RMSE at each iteration over 30 seeds; error bars show 20th and 80th percentiles. Median rel. RMSE converges to $\leq 0.1\%$ in under fifteen iterations for all environments, whereas $T=200$; five of eight converge within 5 iterations.
  • Figure 4: Speedup of Picard Iteration relative to sequential, as a function of different number of batch size and retailers $N$.
  • Figure 5: Speedup of Picard Iteration relative to sequential, as a function of size of neural net.

Theorems & Definitions (9)

  • Proposition 2.1
  • Theorem 3.1: Informal
  • Lemma 3.2
  • Theorem 3.2
  • Lemma 1.1
  • Theorem 2.1
  • Lemma 2.2
  • Theorem 3.1
  • proof