Table of Contents
Fetching ...

Optimal Decision Tree Policies for Markov Decision Processes

Daniël Vos, Sicco Verwer

TL;DR

Optimal MDP Decision Trees (OMDT) addresses interpretable reinforcement learning by directly optimizing size-limited decision-tree policies for MDPs via a MILP formulation. It leverages a dual linear program perspective to couple tree structure with state-action frequencies, enabling exact optimization under a user-specified tree size. Empirically, depth-3 OMDTs achieve near-optimal performance across multiple environments and can outperform imitation-learning baselines like VIPER, while dtcontrol yields optimal policies but with substantially larger trees. The work establishes a foundation for integrating optimal decision-tree methods with RL and suggests directions toward scalable extensions such as factored MDPs and simulation-based learning.

Abstract

Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies there is no guarantee that the learners generate a decision that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal decision tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.

Optimal Decision Tree Policies for Markov Decision Processes

TL;DR

Optimal MDP Decision Trees (OMDT) addresses interpretable reinforcement learning by directly optimizing size-limited decision-tree policies for MDPs via a MILP formulation. It leverages a dual linear program perspective to couple tree structure with state-action frequencies, enabling exact optimization under a user-specified tree size. Empirically, depth-3 OMDTs achieve near-optimal performance across multiple environments and can outperform imitation-learning baselines like VIPER, while dtcontrol yields optimal policies but with substantially larger trees. The work establishes a foundation for integrating optimal decision-tree methods with RL and suggests directions toward scalable extensions such as factored MDPs and simulation-based learning.

Abstract

Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies there is no guarantee that the learners generate a decision that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal decision tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
Paper Structure (35 sections, 19 equations, 7 figures, 4 tables)

This paper contains 35 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Depth 2 OMDT on the stochastic Frozenlake 4x4 environment. OMDT proves that no better depth 2 decision tree policy exists (discounted return $0.37$ with $\gamma=0.99$).
  • Figure 2: Overview of OMDT's formulation. We maximize the discounted return in an MDP under the constraint that the policy is represented by a size-limited decision tree.
  • Figure 3: (top) Normalized return and bounds for OMDT trees of depth 3, optimal policies score 1 while uniform random policies score 0. (bottom) Log of tree sizes for OMDT (maximum depth 3) and dtcontrol. Dtcontrol always produces an optimal policy but the trees are orders of magnitude larger than OMDT.
  • Figure 4: Paths taken on 10,000 Frozenlake_12x12 runs. The agent starts at (0, 0) and attempts to reach the goal tile 'G' while avoiding holes. Actions are indicated by arrows and are somewhat stochastic, i.e. an action of 'up' will send the agent 'left', 'up', or 'right' (but never down) with equal probability. VIPER fails to produce a good policy because it spends capacity of its tree mimicking parts of the complex teacher policy that its simple student policy will never reach. OMDT achieves a greater success rate by directly optimizing a simple policy.
  • Figure 5: System administrator topologies.
  • ...and 2 more figures