Table of Contents
Fetching ...

Learning Interpretable Characteristic Kernels via Decision Forests

Sambit Panda, Cencheng Shen, Joshua T. Vogelstein

TL;DR

It is proved that the decision forest induced proximity can be made characteristic, which can be used to yield a universally consistent statistic for testing independence, and how this learning kernel offers insights into relative feature importance.

Abstract

Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We construct Kernel Mean Embedding Random Forests (KMERF), which induce kernels from random trees and/or forests using leaf-node proximity. We introduce the notion of an asymptotically characteristic kernel, and prove that KMERF kernels are asymptotically characteristic for both discrete and continuous data. Because KMERF is data-adaptive, we suspected it would outperform kernels selected a priori on finite sample data. We illustrate that KMERF nearly dominates current state-of-the-art kernel-based tests across a diverse range of high-dimensional two-sample and independence testing settings. Furthermore, our forest-based approach is interpretable, and provides feature importance metrics that readily distinguish important dimensions, unlike other high-dimensional non-parametric testing procedures. Hence, this work demonstrates the decision forest-based kernel can be more powerful and more interpretable than existing methods, flying in the face of conventional wisdom of the trade-off between the two.

Learning Interpretable Characteristic Kernels via Decision Forests

TL;DR

It is proved that the decision forest induced proximity can be made characteristic, which can be used to yield a universally consistent statistic for testing independence, and how this learning kernel offers insights into relative feature importance.

Abstract

Decision forests are widely used for classification and regression tasks. A lesser known property of tree-based methods is that one can construct a proximity matrix from the tree(s), and these proximity matrices are induced kernels. While there has been extensive research on the applications and properties of kernels, there is relatively little research on kernels induced by decision forests. We construct Kernel Mean Embedding Random Forests (KMERF), which induce kernels from random trees and/or forests using leaf-node proximity. We introduce the notion of an asymptotically characteristic kernel, and prove that KMERF kernels are asymptotically characteristic for both discrete and continuous data. Because KMERF is data-adaptive, we suspected it would outperform kernels selected a priori on finite sample data. We illustrate that KMERF nearly dominates current state-of-the-art kernel-based tests across a diverse range of high-dimensional two-sample and independence testing settings. Furthermore, our forest-based approach is interpretable, and provides feature importance metrics that readily distinguish important dimensions, unlike other high-dimensional non-parametric testing procedures. Hence, this work demonstrates the decision forest-based kernel can be more powerful and more interpretable than existing methods, flying in the face of conventional wisdom of the trade-off between the two.

Paper Structure

This paper contains 19 sections, 6 theorems, 9 equations, 6 figures.

Key Result

Theorem 1

\newlabelthm10 The random forest induced kernel $\mathbf{K}^{\mathbf{x}}$ is always positive definite.

Figures (6)

  • Figure 1: Multivariate independence testing power for $20$ different settings with increasing $p$, fixed $q=1$, and $n=100$. For the majority of the simulations and simulation dimensions, KMERF performs as well as, or better than, existing multivariate independence tests in high-dimensional dependence testing.
  • Figure 2: Multivariate two-sample testing power for $20$ different settings with increasing $p$, fixed $q=1$, and $n=100$. For nearly all simulations and simulation dimensions, KMERF performs as well as, or better than, existing multivariate two-sample tests in high-dimensional dependence testing.
  • Figure 3: Normalized mean (black) and 95% confidence intervals (light grey) using min-max normalization for relative feature importances derived from random forest over five dimensions for each simulation tested for 100 samples. The features were sorted from most to least informative for all simulations except for the Independence simulation). As expected, estimated feature importance decreases as dimension increases. A feature of KMERF is insights into interpretability, and we show here which dimensions of our simulations influence the outcome of independence test the most.
  • Figure 4: (A) For each peptide, the p-values for testing dependence between pancreatic and healthy subjects by KMERF is compared to the p-value for testing dependence between pancreatic and all other subjects. At the critical level 0.05, KMERF identifies a unique protein. (B) The true and false positive counts using a k-nearest neighbor (choosing the best $k \in [1,10]$) leave-one-out classification using only the significant peptides identified by each method. The peptide identified by KMERF achieves the best true and false positive rates.
  • Figure F1: Simulations used for Figures 1 and 3. 100 points from noisy simulations (light grey points) on 1000 points from simulations without noise (dark grey points) for each of the 20 dimensional simulations shown above.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Corollary 3
  • Theorem 1
  • Proof 1
  • Theorem 2
  • Proof 2
  • Corollary 3
  • Proof 3