Table of Contents
Fetching ...

xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, Mikhail Belkin

TL;DR

xRFM tackles tabular data prediction by merging feature-learning kernel machines with a tree-based partitioning scheme to capture local data structure. The approach enables local feature learning in leaves while maintaining near-linear training time and logarithmic inference, and it provides native interpretability through the Average Gradient Outer Product. Empirically, xRFM achieves state-of-the-art performance on 100 tabular regression datasets and remains competitive on 200 classification datasets, outperforming GBDTs in several benchmarks. This scalable, interpretable framework is well-suited to uncover heterogeneity and structure in large-scale tabular data, with strong practical implications for real-world prediction tasks.

Abstract

Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.

xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

TL;DR

xRFM tackles tabular data prediction by merging feature-learning kernel machines with a tree-based partitioning scheme to capture local data structure. The approach enables local feature learning in leaves while maintaining near-linear training time and logarithmic inference, and it provides native interpretability through the Average Gradient Outer Product. Empirically, xRFM achieves state-of-the-art performance on 100 tabular regression datasets and remains competitive on 200 classification datasets, outperforming GBDTs in several benchmarks. This scalable, interpretable framework is well-suited to uncover heterogeneity and structure in large-scale tabular data, with strong practical implications for real-world prediction tasks.

Abstract

Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to other methods, including recently introduced tabular foundation models (TabPFNv2) and GBDTs, xRFM achieves best performance across regression datasets and is competitive to the best methods across classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.

Paper Structure

This paper contains 23 sections, 6 equations, 8 figures, 11 tables, 5 algorithms.

Figures (8)

  • Figure 1: Overview of xRFM training and inference procedures. (A) xRFM is trained by splitting the data along the median projections (denoted $c_1, c_2$) onto computed split directions (denoted $v_1, v_2$). Data is split repeatedly into leaves, which contain at most $C$ training samples. Leaf RFMs are trained on the data at each leaf. (B) During inference, test data is routed to the appropriate leaf RFM based on split directions. The prediction is generated by the selected leaf RFM.
  • Figure 2: Training xRFM on synthetic data where splitting on the top AGOP direction enables xRFM to learn locally relevant features.
  • Figure 3: Performance and runtime of xRFM on the TALENT (Plots A-C) and TabArena-Lite benchmarks (Plots D-F). The y-axes of plots A-C are the shifted geometric mean of the error across all datasets in that category, while the x-axes are the average over all datasets of the training plus inference time per 1000 samples for just the best hyperparameter configuration (meaning if a dataset has $n$ samples, we compute the training and inference time on the $n$ samples divide the total time by $n / 1000$). The y-axes in plots D-F are Elo, the main metric used in TabArena, and reflect the relative win-rate of each method, while the x-axes are the median inference time per 1000 total samples. The TabArena-Lite plots additionally show the default methods and the compute Pareto tradeoff curve for inference time versus Elo. The TALENT plots are (A) nRMSE over 100 regression datasets, (B) classification error over 80 multi-class datasets, (C) classification error over 120 binary classification datasets. The TabArena-Lite plots are (D) regression, (E) multi-class, and (F) binary datasets.
  • Figure 4: Comparisons of xRFM with (A) kernel ridge regression and (B) the original RFM radhakrishnan2025 on TALENT regression datasets (metric is normalized $R^2$, see Appendix \ref{['app: methods']}). Each point is a dataset.
  • Figure 5: Total training and inference time for the best hyperparameter configuration as a function of the number of samples (training+validation+testing) across the TALENT benchmark. Curves indicate piece-wise linear fit to measures on each dataset (shown as points). (A) Results across 100 regression tasks. (B) Results across 80 multi-class classification tasks. (C) Results across 120 binary classification tasks.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3