Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

Sascha Marton; Tim Grams; Florian Vogt; Stefan Lüdtke; Christian Bartelt; Heiner Stuckenschmidt

Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

Sascha Marton, Tim Grams, Florian Vogt, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

TL;DR

SYMPOL introduces a direct, gradient-based method to learn hard, axis-aligned decision-tree policies within on-policy reinforcement learning, eliminating information loss from post-hoc distillation. By integrating GradTree with PPO, and employing a dynamic rollout buffer, gradient accumulation, and targeted weight decay, SYMPOL achieves competitive performance across control and grid-world tasks while yielding small, interpretable trees. The approach is framework-agnostic, supports continuous action spaces, and demonstrates that interpretable policies can rival or exceed some full-complexity models, with a case study showing how interpretability aids in detecting goal misgeneralization. Overall, SYMPOL provides a practical foundation for trustworthy, explainable RL in safety-critical and high-stakes domains, with avenues for extension to off-policy methods and more complex tree ensembles.

Abstract

Reinforcement learning (RL) has seen significant success across various domains, but its adoption is often limited by the black-box nature of neural network policies, making them difficult to interpret. In contrast, symbolic policies allow representing decision-making strategies in a compact and interpretable way. However, learning symbolic policies directly within on-policy methods remains challenging. In this paper, we introduce SYMPOL, a novel method for SYMbolic tree-based on-POLicy RL. SYMPOL employs a tree-based model integrated with a policy gradient method, enabling the agent to learn and adapt its actions while maintaining a high level of interpretability. We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. Unlike existing methods, it enables gradient-based, end-to-end learning of interpretable, axis-aligned decision trees within standard on-policy RL algorithms. Therefore, SYMPOL can become the foundation for a new class of interpretable RL based on decision trees. Our implementation is available under: https://github.com/s-marton/sympol

Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 19 figures, 21 tables)

This paper contains 31 sections, 6 equations, 19 figures, 21 tables.

Introduction
Related Work
Preliminaries
SYMPOL: Symbolic On-Policy RL
Learning DTs with Policy Gradients
Addressing Training Stability
Evaluation
Experimental Settings
Results
SYMPOL does not exhibit information loss.
SYMPOL learns accurate DT policies.
DT policies offer a good inductive bias for categorical environments.
DT policies learned with SYMPOL are small and interpretable.
SYMPOL is efficient.
Ablation study.
...and 16 more sections

Figures (19)

Figure 1: Information Loss in Tree-Based Reinforcement Learning on Pendulum. Existing methods for symbolic, tree-based RL (Figure \ref{['fig:sadt']} and \ref{['fig:d-sdt']}) suffer from severe information loss when converting the differentiable policy used for training (e.g., the MLP for SA-DT) into the symbolic policy used for interpretation (i.e., the DT). Using SYMPOL (Figure \ref{['fig:sympol']}), we can directly optimize the symbolic policy with PPO and therefore have no information loss during the application.
Figure 2: Standard vs. Dense DT Representation. A comparison between the standard decision tree representation and its dense equivalent, illustrated using an example decision tree of depth 2, with a state space of dimensionality 3 and two possible actions.
Figure 3: Selected Training Curves. Shows the training reward of the full-complexity policy (e.g. MLP in the case of SA-DT) as solid line and the test reward of the interpretable policy as dashed line for three control environments. Additional, more detailed results are in Appendix \ref{['A:runtime_curves']}.
Figure 4: SYMPOL Policy for MC-C. The main rule encoded by this DT is that the car should accelerate to the left, if its velocity is negative, and to the right if it is positive. This essentially increases the speed of the car over time, making it possible to reach the goal at the top of the hill. The magnitude of acceleration is mainly determined by the current position, reducing the action cost.
Figure 5: Ablation Study. We report the mean normalized reward over all control environments (details in Table \ref{['tab:ablation']}).
...and 14 more figures

Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

TL;DR

Abstract

Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (19)