Table of Contents
Fetching ...

Improved Monte Carlo Planning via Causal Disentanglement for Structurally-Decomposed Markov Decision Processes

Larkin Liu, Shiqi Liu, Yinruo Hua, Matej Jusup

TL;DR

The paper tackles the computational burden of planning in MDPs by exploiting causal structure to form Structurally Decomposed MDPs (SD-MDPs). By disentangling stochastic environmental transitions from deterministic reward-driven dynamics, the framework reduces the sequential optimization to a fractional knapsack-like problem with complexity $O(T\log T)$, independent of state-action dimensionality. It further integrates this abstraction with Monte Carlo Tree Search (MCTS) using Top$_k$ allocations and value clipping, and proves vanishing simple regret under budgeted simulation, supported by empirical results in logistics, energy, and finance. The approach enables scalable, near-optimal planning in high-dimensional settings and offers a principled pathway to combine causal reasoning with Monte Carlo planning in complex, resource-constrained domains.

Abstract

Markov Decision Processes (MDPs), as a general-purpose framework, often overlook the benefits of incorporating the causal structure of the transition and reward dynamics. For a subclass of resource allocation problems, we introduce the Structurally Decomposed MDP (SD-MDP), which leverages causal disentanglement to partition an MDP's temporal causal graph into independent components. By exploiting this disentanglement, SD-MDP enables dimensionality reduction and computational efficiency gains in optimal value function estimation. We reduce the sequential optimization problem to a fractional knapsack problem with log-linear complexity $O(T \log T)$, outperforming traditional stochastic programming methods that exhibit polynomial complexity with respect to the time horizon $T$. Additionally, SD-MDP's computational advantages are independent of state-action space size, making it viable for high-dimensional spaces. Furthermore, our approach integrates seamlessly with Monte Carlo Tree Search (MCTS), achieving higher expected rewards under constrained simulation budgets while providing a vanishing simple regret bound. Empirical results demonstrate superior policy performance over benchmarks across various logistics and finance domains.

Improved Monte Carlo Planning via Causal Disentanglement for Structurally-Decomposed Markov Decision Processes

TL;DR

The paper tackles the computational burden of planning in MDPs by exploiting causal structure to form Structurally Decomposed MDPs (SD-MDPs). By disentangling stochastic environmental transitions from deterministic reward-driven dynamics, the framework reduces the sequential optimization to a fractional knapsack-like problem with complexity , independent of state-action dimensionality. It further integrates this abstraction with Monte Carlo Tree Search (MCTS) using Top allocations and value clipping, and proves vanishing simple regret under budgeted simulation, supported by empirical results in logistics, energy, and finance. The approach enables scalable, near-optimal planning in high-dimensional settings and offers a principled pathway to combine causal reasoning with Monte Carlo planning in complex, resource-constrained domains.

Abstract

Markov Decision Processes (MDPs), as a general-purpose framework, often overlook the benefits of incorporating the causal structure of the transition and reward dynamics. For a subclass of resource allocation problems, we introduce the Structurally Decomposed MDP (SD-MDP), which leverages causal disentanglement to partition an MDP's temporal causal graph into independent components. By exploiting this disentanglement, SD-MDP enables dimensionality reduction and computational efficiency gains in optimal value function estimation. We reduce the sequential optimization problem to a fractional knapsack problem with log-linear complexity , outperforming traditional stochastic programming methods that exhibit polynomial complexity with respect to the time horizon . Additionally, SD-MDP's computational advantages are independent of state-action space size, making it viable for high-dimensional spaces. Furthermore, our approach integrates seamlessly with Monte Carlo Tree Search (MCTS), achieving higher expected rewards under constrained simulation budgets while providing a vanishing simple regret bound. Empirical results demonstrate superior policy performance over benchmarks across various logistics and finance domains.

Paper Structure

This paper contains 58 sections, 6 theorems, 54 equations, 15 figures, 4 tables, 2 algorithms.

Key Result

Lemma 2.1

Finite and Bounded Action Space for the SD-MDP: For the SD-MDP, for any action taken in the finite time horizon, optimal policy lies to the union of 2 subspaces, that is $\mathbf{a}^* \subset \{ \mathbf{a}^+ \} \cup \{ \mathbf{a}^- \} \subset \mathcal{A}(t) \subseteq \mathcal{A}$, for all time step

Figures (15)

  • Figure 1: Causal Structure & Partitioning of the SD-MDP: The SD-MDP splits transition dynamics into stochastic component $\mathbf{x}_\eta^t$ and deterministic $\mathbf{x}_d^t$. The reward $\mu^t$ is driven by both partitions, and the action $\mathbf{a}^t$.
  • Figure 2: Norm-Capacity Dynamics: As the capacity of $\mathbf{x}_d$ shrinks given the constraints of the norm-capacity, the consumption of resource can be transformed into a reward $\langle \phi f(\mathbf{x}_\eta^t), \, \mathbf{a}^t \rangle$. The blue shading represents shrinkage of the the resource capacity, and the orange shading represents the vector space of possible outcomes, the magnitude of this vector (represented by the red arrow) represents the reward.
  • Figure 3: We illustrate the convergence to the optimal value function as a function of the number of MC iterations for the MENTS algorithm xiao:2019-ments. We demonstrate that MENTS VC yields stronger value convergence properties compared to vanilla MENTS.
  • Figure 4: We compare empirical results based on cost reduction or reward maximization. The leftmost boxplot presents an instance-dependent baseline for reference. Evidently, MCTS value clipping within the SD-MDP framework improves expected cost/reward performance over vanilla MCTS, as shown for both UCT and MENTS variants.
  • Figure 5: Atlantic Pacific Express (APX) liner route. yao2012study
  • ...and 10 more figures

Theorems & Definitions (6)

  • Lemma 2.1
  • Lemma 2.2
  • Theorem 1
  • Theorem 2
  • Lemma A.1
  • Lemma A.2