Table of Contents
Fetching ...

Small Decision Trees for MDPs with Deductive Synthesis

Roman Andriushchenko, Milan Češka, Sebastian Junges, Filip Macák

TL;DR

The paper addresses the challenge of representing optimal or near-optimal policies for Markov decision processes (MDPs) with small, interpretable decision trees (DTs). It introduces a two‑pronged approach: (i) an SMT-based encoding to map a given policy into a small DT, and (ii) dtPaynt, an abstraction‑refinement loop that searches for policies optimal within the class of DTs up to a fixed depth. The method leverages unsatisfiable cores and a harmonization mechanism to iteratively refine the search, overapproximate policy families, and split the space into tractable subproblems. Empirically, dtPaynt outperforms the state-of-the-art MILP-based synthesis (OMDT) on many benchmarks, producing DTs up to 20× smaller and achieving near‑optimal values on models with up to 10k states and 19 variables, and it scales to much larger MCs when small DTs suffice. This work advances interpretable policy synthesis for large, real-world MDPs and offers a practical reduction pathway to compact, verifiable decision trees for complex policies.

Abstract

Markov decision processes (MDPs) describe sequential decision-making processes; MDP policies return for every state in that process an advised action. Classical algorithms can efficiently compute policies that are optimal with respect to, e.g., reachability probabilities. However, these policies are then given in a tabular format. A longstanding challenge is to represent optimal or almost-optimal policies concisely, e.g., as decision trees. This paper makes two contributions towards this challenge: first, an SMT-based approach to encode a given (optimal) policy as a small decision tree, and second, an abstraction-refinement loop that searches for policies that are optimal within the set of policies that can be represented with a small tree. Technically, the latter combines the SMT encoding with verification approaches for families of Markov chains. The empirical evaluation demonstrates the feasibility of these approaches and shows how they can outperform the state-of-the-art on various benchmarks, yielding up to 20 times smaller trees representing (almost) optimal policies for models with up to 10k states and 19 variables.

Small Decision Trees for MDPs with Deductive Synthesis

TL;DR

The paper addresses the challenge of representing optimal or near-optimal policies for Markov decision processes (MDPs) with small, interpretable decision trees (DTs). It introduces a two‑pronged approach: (i) an SMT-based encoding to map a given policy into a small DT, and (ii) dtPaynt, an abstraction‑refinement loop that searches for policies optimal within the class of DTs up to a fixed depth. The method leverages unsatisfiable cores and a harmonization mechanism to iteratively refine the search, overapproximate policy families, and split the space into tractable subproblems. Empirically, dtPaynt outperforms the state-of-the-art MILP-based synthesis (OMDT) on many benchmarks, producing DTs up to 20× smaller and achieving near‑optimal values on models with up to 10k states and 19 variables, and it scales to much larger MCs when small DTs suffice. This work advances interpretable policy synthesis for large, real-world MDPs and offers a practical reduction pathway to compact, verifiable decision trees for complex policies.

Abstract

Markov decision processes (MDPs) describe sequential decision-making processes; MDP policies return for every state in that process an advised action. Classical algorithms can efficiently compute policies that are optimal with respect to, e.g., reachability probabilities. However, these policies are then given in a tabular format. A longstanding challenge is to represent optimal or almost-optimal policies concisely, e.g., as decision trees. This paper makes two contributions towards this challenge: first, an SMT-based approach to encode a given (optimal) policy as a small decision tree, and second, an abstraction-refinement loop that searches for policies that are optimal within the set of policies that can be represented with a small tree. Technically, the latter combines the SMT encoding with verification approaches for families of Markov chains. The empirical evaluation demonstrates the feasibility of these approaches and shows how they can outperform the state-of-the-art on various benchmarks, yielding up to 20 times smaller trees representing (almost) optimal policies for models with up to 10k states and 19 variables.
Paper Structure (76 sections, 10 theorems, 8 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 76 sections, 10 theorems, 8 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

corollary 1

(Proof in App. appendix:proof:random) Given an MDP $M$, it holds that $V(M)~=~V(M')$.

Figures (7)

  • Figure 1: (a) A simple slippery maze. The goal is to lead the robot placed in the lower left corner (the red dot) towards the exit cell while minimizing the number of steps. Red arrowheads illustrate the optimal policy that achieves a value of 12.8 (expected steps). Blue arrowheads illustrate a sub-optimal policy that achieves a value of 13.3. (b) The smallest DT implementing the optimal policy has depth 4. (c) The smallest DT implementing the sub-optimal policy has only depth 2.
  • Figure 2: MDP construction for the reduction from X3C problem.
  • Figure 3: Comparison on the bounded-depth synthesis problem for Q1. The scatter plot shows the normalized values of the best $k$-DTs found in the 20-minute timeout. A point $(x,y)$ shows the value of the best $k$-DT found by dtPaynt (the $x$-value) and OMDT (the $y$-value) for a specific model and a specific depth $k$. Points below the diagonal shows the synthesis problems where dtPaynt outperforms OMDT. The plot on the left compares performance on smaller models (less than 3k choices), and the plot on the right compares performance on larger models.
  • Figure 4: The plot shows the trade-off between the value and size of the DTs found by dtPaynt and dtControl. The left part contains smaller models, and the right part contains larger ones. The upper part shows the normalized values of the synthesized DTs (the same normalization as in Q1). dtControl maps an optimal policy with the value 1 (not shown). The lower part shows the number of inner nodes using a logarithmic scale. We report the performance of dtPaynt for depths 1 to 8 (if a DT with a normalized value above 0.95 is obtained, we exclude the subsequent depths on the given model to simplify the plot). Numbers in brackets above the dtControl bars denote the depth of the DT.
  • Figure 5: (a) The Exact Cover by 3-sets (X3C) problem formulation. (b) MDP construction for the reduction from X3C problem formulation. Note that the values of state variables $T_i$ were chosen arbitrarily and serve only as an example. This figure is the same as Fig. \ref{['fig:x3c-to-mdp']}. (c) DT corresponding to some exact cover $T'$ with a depth $|U|/3+2$.
  • ...and 2 more figures

Theorems & Definitions (26)

  • definition 1: MDP
  • corollary 1
  • remark 1: Interpretability of random action
  • definition 2: Decision tree
  • definition 3: Corresponding states
  • definition 4: Induced policy
  • remark 2: Fallback action interpretability
  • definition 5: Tree template
  • definition 6: Instantitation
  • definition 7: Parameterization
  • ...and 16 more