Table of Contents
Fetching ...

Autoencoding Random Forests

Binh Duc Vu, Jan Kapar, Marvin Wright, David S. Watson

TL;DR

The paper addresses learning latent embeddings from random forests and decoding back to input space by viewing RFs as adaptive kernels. It develops a diffusion-map-based encoding that yields a low-dimensional embedding and introduces three decoding strategies—constrained optimization, split relabeling, and $k$-NN decoding—each with universal consistency guarantees. The work provides theoretical results establishing the RF kernel as PSD, doubly stochastic, universal, and characteristic, and demonstrates the RF autoencoder (RFAE) in visualization, compression, clustering, and denoising across tabular, image, and genomic data. Practically, this yields a versatile, interpretable alternative to deep autoencoders for mixed data types, with competitive performance and scalable decoding options.

Abstract

We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.

Autoencoding Random Forests

TL;DR

The paper addresses learning latent embeddings from random forests and decoding back to input space by viewing RFs as adaptive kernels. It develops a diffusion-map-based encoding that yields a low-dimensional embedding and introduces three decoding strategies—constrained optimization, split relabeling, and -NN decoding—each with universal consistency guarantees. The work provides theoretical results establishing the RF kernel as PSD, doubly stochastic, universal, and characteristic, and demonstrates the RF autoencoder (RFAE) in visualization, compression, clustering, and denoising across tabular, image, and genomic data. Practically, this yields a versatile, interpretable alternative to deep autoencoders for mixed data types, with competitive performance and scalable decoding options.

Abstract

We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.

Paper Structure

This paper contains 31 sections, 6 theorems, 18 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.4

Assume standard RF regularity conditions (see Appx. appx:proofs). Then: (a) For all $n \in \mathbb N$, the function $k_n^{RF}$ is PSD and the kernel matrix $\mathbf K \in [0,1]^{n \times n}$ is doubly stochastic. (b) Let $\{f_n\}$ be a sequence of RFs. Then the associated RKHS sequence $\{\mathcal{H (c) The RKHS sequence $\{\mathcal{H}_n\}$ is asymptotically characteristic. That is, for any $\epsi

Figures (9)

  • Figure 1: Visual summary of the encoding pipeline. (a) Input data can be a mix of continuous, ordinal, and/or categorical variables. (b) A RF (supervised or unsupervised) is trained on the data. (c) A kernel matrix $\mathbf K \in [0,1]^{n \times n}$ is extracted from the ensemble. (d) $\mathbf K$ is decomposed into its eigenvectors and eigenvalues, as originally proposed by David Hilbert (pictured). (e) Data is projected onto the top $d_{\mathcal{Z}} < n$ principal components of the diffusion map, resulting in a new embedding $\mathbf Z \in \mathbb R^{n \times d_{\mathcal{Z}}}$.
  • Figure 2: Diffusion maps visualize RF training. Using a subsample of the MNIST dataset, we find that digits become more distinct in the embedding space as tree depth increases.
  • Figure 3: MNIST digit reconstructions with varying latent dimension sizes; original images are displayed in the bottom row.
  • Figure 4: Denoising with RFAE alleviates batch effects in scRNA-seq data.
  • Figure 5: Compression-distortion trade-off on twenty benchmark tabular datasets. Shading represents standard errors across ten bootstraps.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 3.1: Positive semidefinite
  • Definition 3.2: Universal
  • Definition 3.3: Characteristic
  • Theorem 3.4: RF kernel properties
  • Definition 4.1: Universally consistent decoder
  • Theorem 4.2: Oracle consistency
  • Theorem 4.3: Uniqueness
  • Theorem 4.4: $k$-NN consistency
  • Lemma A.1: RF subalgebra
  • proof
  • ...and 1 more