Table of Contents
Fetching ...

Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

Hector Kohler, Quentin Delfosse, Waris Radji, Riad Akrour, Philippe Preux

TL;DR

This work tackles RL policy interpretability by introducing a humans-in-the-loop proxy framework that evaluates simulatability without user studies. It distills expert neural policies into compact, interpretable baselines via imitation learning (behavior cloning, DAgger, and $Q$-DAgger) and standardizes representations through unfolding into a common Python-like language to measure interpretability via policy inference time and policy size. Across classic control, MuJoCo, and OCAtari tasks, the study shows interpretability does not universally trade off with performance, and no single policy class dominates interpretability across tasks; environment characteristics heavily influence outcomes. The authors provide large-scale baselines and a scalable methodology to compare interpretable reinforcement learning policies, enabling systematic study of trade-offs and verification implications for practical deployments.

Abstract

There exist applications of reinforcement learning like medicine where policies need to be ''interpretable'' by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ''simulatability'': policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

TL;DR

This work tackles RL policy interpretability by introducing a humans-in-the-loop proxy framework that evaluates simulatability without user studies. It distills expert neural policies into compact, interpretable baselines via imitation learning (behavior cloning, DAgger, and -DAgger) and standardizes representations through unfolding into a common Python-like language to measure interpretability via policy inference time and policy size. Across classic control, MuJoCo, and OCAtari tasks, the study shows interpretability does not universally trade off with performance, and no single policy class dominates interpretability across tasks; environment characteristics heavily influence outcomes. The authors provide large-scale baselines and a scalable methodology to compare interpretable reinforcement learning policies, enabling systematic study of trade-offs and verification implications for practical deployments.

Abstract

There exist applications of reinforcement learning like medicine where policies need to be ''interpretable'' by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ''simulatability'': policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

Paper Structure

This paper contains 25 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Generic linear policy with hidden operations interacting with an environment.
  • Figure 2: Unfolded linear policy interacting with an environment.
  • Figure 3: Performance of imitation learning variants of Algorithm \ref{['alg:distill']} on different environments. We plot the 95% stratified bootstrapped confidence intervals around the IQMs.
  • Figure 4: Performance profiles of different policy classes on different environments.
  • Figure 5: simulatability proxies on classic control environments. We plot 95% stratified bootstrapped confidence intervals around means in both axes. In each sub-plot, interpreatbility is measured with the proxy corresponding to the sub-title.
  • ...and 5 more figures