RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

Mateus P. Otto; Rafael Izbicki

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

Mateus P. Otto, Rafael Izbicki

TL;DR

RFFNet tackles the dual challenge of scaling kernel methods and maintaining interpretability by marrying random Fourier features with product ARD kernels. It jointly learns per-feature relevances $\boldsymbol{\theta}$ and predictive parameters $\boldsymbol{\beta}$ via stochastic optimization, using fixed random features to keep memory and computation low. The method yields interpretable feature importances while delivering competitive predictive accuracy across simulations and real-world data, and includes a thresholding procedure (TopK) for reliable variable selection. A PyTorch-based implementation alongside ablation studies supports practical adoption and reproducibility in large-scale settings.

Abstract

Kernel methods provide a flexible and theoretically grounded approach to nonlinear and nonparametric learning. While memory and run-time requirements hinder their applicability to large datasets, many low-rank kernel approximations, such as random Fourier features, were recently developed to scale up such kernel methods. However, these scalable approaches are based on approximations of isotropic kernels, which cannot remove the influence of irrelevant features. In this work, we design random Fourier features for a family of automatic relevance determination (ARD) kernels, and introduce RFFNet, a new large-scale kernel method that learns the kernel relevances' on the fly via first-order stochastic optimization. We present an effective initialization scheme for the method's non-convex objective function, evaluate if hard-thresholding RFFNet's learned relevances yield a sensible rule for variable selection, and perform an extensive ablation study of RFFNet's components. Numerical validation on simulated and real-world data shows that our approach has a small memory footprint and run-time, achieves low prediction error, and effectively identifies relevant features, thus leading to more interpretable solutions. We supply users with an efficient, PyTorch-based library, that adheres to the scikit-learn standard API and code for fully reproducing our results.

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

TL;DR

RFFNet tackles the dual challenge of scaling kernel methods and maintaining interpretability by marrying random Fourier features with product ARD kernels. It jointly learns per-feature relevances

and predictive parameters

via stochastic optimization, using fixed random features to keep memory and computation low. The method yields interpretable feature importances while delivering competitive predictive accuracy across simulations and real-world data, and includes a thresholding procedure (TopK) for reliable variable selection. A PyTorch-based implementation alongside ablation studies supports practical adoption and reproducibility in large-scale settings.

Abstract

Paper Structure (31 sections, 3 theorems, 38 equations, 11 figures, 14 tables, 3 algorithms)

This paper contains 31 sections, 3 theorems, 38 equations, 11 figures, 14 tables, 3 algorithms.

Introduction
Method Overview and Novelty
Relation to other work
Notation and Organization
Background
Kernel methods
Random Fourier Features
Automatic Relevance Determination (ARD) kernels
Overview of RFFNet
Random Fourier features for product ARD kernels
The objective function and training
Thresholding for variable selection
Results
Simulations
Variable selection
...and 16 more sections

Key Result

Proposition 1

Let $k_\theta: {\mathcal{X}} \times {\mathcal{X}} \to {\mathbb R}$ be an ARD kernel. Let $\bm{{z}}: {\mathbb R}^d \to {\mathbb R}^s$, $s \ge 1$, be the random Fourier features map for $k_{\mathbf{1}_d}$, the isotropic version of $k_\theta$. Then, is an unbiased estimator of $k_\theta(\bm{{x}}, \bm{{x}}')$.

Figures (11)

Figure 1: RFFNet combines fast kernel approximations to ARD kernels to carefully initialized and efficient stochastic gradient descent methods. The core part of RFFNet is implemented as a single PyTorch layer and can be seamlessly connected to other PyTorch-based models, making RFFNet highly modular.
Figure 2: Relevances for a realization of the gse1, gse2, jse2 and jse3 simulated datasets. The labels on the x-axis indicate the active features for each dataset (see \ref{['appendix:datasets']}). The relevances output by RFFNet peak exactly on the active features of these datasets.
Figure 3: The empirical False/True Discovery Rates (FDR/TDRs) for the simulated datasets, gse1 and gse2. Increasing the sample size controls the FDR and improves the identification of active features.
Figure 4: Relevance patterns for the amazon (top) and higgs (bottom) datasets. RFFNet associates greater relevance to features that are intuitively (for the amazon dataset) and scientifically relevant (for the higgs dataset).
Figure 5: Validation squared error loss as a function of the training epoch for the sampling strategies. Using the Gaussian sampling scheme led to faster convergence in all cases.
...and 6 more figures

Theorems & Definitions (8)

Definition 1
Definition 2: Product ARD kernel
Proposition 1
Proposition 2
Lemma 1: Relation between spectral densities of $k_\theta$ and $k_{\mathbf{1}_d}$
proof
proof : Proof of Proposition 1
proof : Proof of Proposition 2

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

TL;DR

Abstract

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)