Autoencoding Random Forests
Binh Duc Vu, Jan Kapar, Marvin Wright, David S. Watson
TL;DR
The paper addresses learning latent embeddings from random forests and decoding back to input space by viewing RFs as adaptive kernels. It develops a diffusion-map-based encoding that yields a low-dimensional embedding and introduces three decoding strategies—constrained optimization, split relabeling, and $k$-NN decoding—each with universal consistency guarantees. The work provides theoretical results establishing the RF kernel as PSD, doubly stochastic, universal, and characteristic, and demonstrates the RF autoencoder (RFAE) in visualization, compression, clustering, and denoising across tabular, image, and genomic data. Practically, this yields a versatile, interpretable alternative to deep autoencoders for mixed data types, with competitive performance and scalable decoding options.
Abstract
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
