Table of Contents
Fetching ...

RGMDT: Return-Gap-Minimizing Decision Tree Extraction in Non-Euclidean Metric Space

Jingdi Chen, Hanhan Zhou, Yongsheng Mei, Carlee Joe-Wong, Gina Adam, Nathaniel D. Bastian, Tian Lan

TL;DR

An upper bound on the return gap between the oracle expert policy and an optimal decision tree policy is established and the Return-Gap-Minimization Decision Tree (RGMDT) algorithm is proposed, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss.

Abstract

Deep Reinforcement Learning (DRL) algorithms have achieved great success in solving many challenging tasks while their black-box nature hinders interpretability and real-world applicability, making it difficult for human experts to interpret and understand DRL policies. Existing works on interpretable reinforcement learning have shown promise in extracting decision tree (DT) based policies from DRL policies with most focus on the single-agent settings while prior attempts to introduce DT policies in multi-agent scenarios mainly focus on heuristic designs which do not provide any quantitative guarantees on the expected return. In this paper, we establish an upper bound on the return gap between the oracle expert policy and an optimal decision tree policy. This enables us to recast the DT extraction problem into a novel non-euclidean clustering problem over the local observation and action values space of each agent, with action values as cluster labels and the upper bound on the return gap as clustering loss. Both the algorithm and the upper bound are extended to multi-agent decentralized DT extractions by an iteratively-grow-DT procedure guided by an action-value function conditioned on the current DTs of other agents. Further, we propose the Return-Gap-Minimization Decision Tree (RGMDT) algorithm, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss. Evaluations on tasks like D4RL show that RGMDT significantly outperforms heuristic DT-based baselines and can achieve nearly optimal returns under given DT complexity constraints (e.g., maximum number of DT nodes).

RGMDT: Return-Gap-Minimizing Decision Tree Extraction in Non-Euclidean Metric Space

TL;DR

An upper bound on the return gap between the oracle expert policy and an optimal decision tree policy is established and the Return-Gap-Minimization Decision Tree (RGMDT) algorithm is proposed, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss.

Abstract

Deep Reinforcement Learning (DRL) algorithms have achieved great success in solving many challenging tasks while their black-box nature hinders interpretability and real-world applicability, making it difficult for human experts to interpret and understand DRL policies. Existing works on interpretable reinforcement learning have shown promise in extracting decision tree (DT) based policies from DRL policies with most focus on the single-agent settings while prior attempts to introduce DT policies in multi-agent scenarios mainly focus on heuristic designs which do not provide any quantitative guarantees on the expected return. In this paper, we establish an upper bound on the return gap between the oracle expert policy and an optimal decision tree policy. This enables us to recast the DT extraction problem into a novel non-euclidean clustering problem over the local observation and action values space of each agent, with action values as cluster labels and the upper bound on the return gap as clustering loss. Both the algorithm and the upper bound are extended to multi-agent decentralized DT extractions by an iteratively-grow-DT procedure guided by an action-value function conditioned on the current DTs of other agents. Further, we propose the Return-Gap-Minimization Decision Tree (RGMDT) algorithm, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss. Evaluations on tasks like D4RL show that RGMDT significantly outperforms heuristic DT-based baselines and can achieve nearly optimal returns under given DT complexity constraints (e.g., maximum number of DT nodes).

Paper Structure

This paper contains 37 sections, 4 theorems, 39 equations, 10 figures, 3 tables, 2 algorithms.

Key Result

Lemma 4.1

(Policy Change Lemma.) For any policies $\pi^*$ and DT policy $\mathcal{T}^{L}$ with $L$ leaf nodes, the optimal expected average return gap is bounded by: where $d^{\mathcal{T}^{L}}_{\mu}(\mathbf{o})$ is the $\gamma$-discounted visitation probability under decision tree $\mathcal{T}^{L}$ and initial observation distribution $\mu$, and $\sum_{\mathbf{o}\sim l}$ is a sum over all observations corr

Figures (10)

  • Figure 1: Evaluation on Maze tasks. (a)-(c): RGMDT (purple bar) completes the tasks in fewer steps than all the baselines. (d)-(f): RGMDT (blue line) achieves a higher mean episode reward than all the baselines in all scenarios with varying complexities, which illustrates its ability to minimize the return gap in hard environments.
  • Figure 2: The normalized mean episode reward increases as the total number of leaf nodes increases(Hard Maze).
  • Figure 3: Comparisons on the $n$-agent tasks: (a)-(c) 2 agents, (d)-(f) 3 agents. RGMDT with limited leaf nodes can learn these tasks much faster than the baselines and have better final performance, even when most of the baselines fail on the task with $2$ and $4$ leaf nodes.
  • Figure 4: The reward of RGMDT (starred purple bar) outperforms all baselines and increases with the number of leaf nodes, achieving performance comparable to the expert RL (no hatch style bar).
  • Figure 5: The return gap (left) is bounded by average cosine distance (right) and diminishes as average cosine distance (right) decreases due to the increase of the maximum number of leaf nodes.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Lemma 4.1
  • Theorem 4.2
  • Lemma 4.3
  • Theorem 4.4
  • Remark 4.5
  • proof
  • proof
  • proof