Table of Contents
Fetching ...

Distilling interpretable causal trees from causal forests

Patrick Rehill

TL;DR

This work tackles the challenge of extracting interpretable insights from high-dimensional CATE distributions produced by causal forests. It introduces the Distilled Causal Tree (DCT), which uses knowledge distillation to learn a single, interpretable tree from a powerful teacher (the causal forest), yielding leaves with doubly robust, asymptotically normal estimates. By pairing KD with an optimal (or evolutionary) tree fitting approach, the method often outperforms other single-tree extractions and, in noisy, high-dimensional settings, can surpass the teacher itself. The approach is demonstrated through simulations on ACIC 2016 data and a real-world canvassing experiment, highlighting its potential to provide actionable, interpretable causal insights for policy and targeting decisions.

Abstract

Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are.

Distilling interpretable causal trees from causal forests

TL;DR

This work tackles the challenge of extracting interpretable insights from high-dimensional CATE distributions produced by causal forests. It introduces the Distilled Causal Tree (DCT), which uses knowledge distillation to learn a single, interpretable tree from a powerful teacher (the causal forest), yielding leaves with doubly robust, asymptotically normal estimates. By pairing KD with an optimal (or evolutionary) tree fitting approach, the method often outperforms other single-tree extractions and, in noisy, high-dimensional settings, can surpass the teacher itself. The approach is demonstrated through simulations on ACIC 2016 data and a real-world canvassing experiment, highlighting its potential to provide actionable, interpretable causal insights for policy and targeting decisions.

Abstract

Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are.
Paper Structure (17 sections, 5 equations, 5 figures, 4 tables)

This paper contains 17 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Ground truth mean absolute error results on original ACIC data. The grey histogram is predictions of the (pruned) individual trees in the ensemble while lines are the specific models for comparison.
  • Figure 2: Ground truth mean absolute error results with noise and correlated features introduced into $X$. The grey histogram is predictions of the (pruned) individual trees in the ensemble while lines are the specific models for comparison.
  • Figure 3: Simulation results on original ACIC data with R-Loss. The grey histogram is predictions of the (pruned) individual trees in the ensemble while lines are the specific models for comparison.
  • Figure 4: Simulation results with noise and correlated features introduced into $X$ with R-Loss. The grey histogram is predictions of the (pruned) individual trees in the ensemble while lines are the specific models for comparison.
  • Figure 5: The DCT for the effect of a cash transfer on maths scores.