Table of Contents
Fetching ...

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

Mateus P. Otto, Rafael Izbicki

TL;DR

RFFNet tackles the dual challenge of scaling kernel methods and maintaining interpretability by marrying random Fourier features with product ARD kernels. It jointly learns per-feature relevances $\boldsymbol{\theta}$ and predictive parameters $\boldsymbol{\beta}$ via stochastic optimization, using fixed random features to keep memory and computation low. The method yields interpretable feature importances while delivering competitive predictive accuracy across simulations and real-world data, and includes a thresholding procedure (TopK) for reliable variable selection. A PyTorch-based implementation alongside ablation studies supports practical adoption and reproducibility in large-scale settings.

Abstract

Kernel methods provide a flexible and theoretically grounded approach to nonlinear and nonparametric learning. While memory and run-time requirements hinder their applicability to large datasets, many low-rank kernel approximations, such as random Fourier features, were recently developed to scale up such kernel methods. However, these scalable approaches are based on approximations of isotropic kernels, which cannot remove the influence of irrelevant features. In this work, we design random Fourier features for a family of automatic relevance determination (ARD) kernels, and introduce RFFNet, a new large-scale kernel method that learns the kernel relevances' on the fly via first-order stochastic optimization. We present an effective initialization scheme for the method's non-convex objective function, evaluate if hard-thresholding RFFNet's learned relevances yield a sensible rule for variable selection, and perform an extensive ablation study of RFFNet's components. Numerical validation on simulated and real-world data shows that our approach has a small memory footprint and run-time, achieves low prediction error, and effectively identifies relevant features, thus leading to more interpretable solutions. We supply users with an efficient, PyTorch-based library, that adheres to the scikit-learn standard API and code for fully reproducing our results.

RFFNet: Large-Scale Interpretable Kernel Methods via Random Fourier Features

TL;DR

RFFNet tackles the dual challenge of scaling kernel methods and maintaining interpretability by marrying random Fourier features with product ARD kernels. It jointly learns per-feature relevances and predictive parameters via stochastic optimization, using fixed random features to keep memory and computation low. The method yields interpretable feature importances while delivering competitive predictive accuracy across simulations and real-world data, and includes a thresholding procedure (TopK) for reliable variable selection. A PyTorch-based implementation alongside ablation studies supports practical adoption and reproducibility in large-scale settings.

Abstract

Kernel methods provide a flexible and theoretically grounded approach to nonlinear and nonparametric learning. While memory and run-time requirements hinder their applicability to large datasets, many low-rank kernel approximations, such as random Fourier features, were recently developed to scale up such kernel methods. However, these scalable approaches are based on approximations of isotropic kernels, which cannot remove the influence of irrelevant features. In this work, we design random Fourier features for a family of automatic relevance determination (ARD) kernels, and introduce RFFNet, a new large-scale kernel method that learns the kernel relevances' on the fly via first-order stochastic optimization. We present an effective initialization scheme for the method's non-convex objective function, evaluate if hard-thresholding RFFNet's learned relevances yield a sensible rule for variable selection, and perform an extensive ablation study of RFFNet's components. Numerical validation on simulated and real-world data shows that our approach has a small memory footprint and run-time, achieves low prediction error, and effectively identifies relevant features, thus leading to more interpretable solutions. We supply users with an efficient, PyTorch-based library, that adheres to the scikit-learn standard API and code for fully reproducing our results.
Paper Structure (31 sections, 3 theorems, 38 equations, 11 figures, 14 tables, 3 algorithms)

This paper contains 31 sections, 3 theorems, 38 equations, 11 figures, 14 tables, 3 algorithms.

Key Result

Proposition 1

Let $k_\theta: {\mathcal{X}} \times {\mathcal{X}} \to {\mathbb R}$ be an ARD kernel. Let $\bm{{z}}: {\mathbb R}^d \to {\mathbb R}^s$, $s \ge 1$, be the random Fourier features map for $k_{\mathbf{1}_d}$, the isotropic version of $k_\theta$. Then, is an unbiased estimator of $k_\theta(\bm{{x}}, \bm{{x}}')$.

Figures (11)

  • Figure 1: RFFNet combines fast kernel approximations to ARD kernels to carefully initialized and efficient stochastic gradient descent methods. The core part of RFFNet is implemented as a single PyTorch layer and can be seamlessly connected to other PyTorch-based models, making RFFNet highly modular.
  • Figure 2: Relevances for a realization of the gse1, gse2, jse2 and jse3 simulated datasets. The labels on the x-axis indicate the active features for each dataset (see \ref{['appendix:datasets']}). The relevances output by RFFNet peak exactly on the active features of these datasets.
  • Figure 3: The empirical False/True Discovery Rates (FDR/TDRs) for the simulated datasets, gse1 and gse2. Increasing the sample size controls the FDR and improves the identification of active features.
  • Figure 4: Relevance patterns for the amazon (top) and higgs (bottom) datasets. RFFNet associates greater relevance to features that are intuitively (for the amazon dataset) and scientifically relevant (for the higgs dataset).
  • Figure 5: Validation squared error loss as a function of the training epoch for the sampling strategies. Using the Gaussian sampling scheme led to faster convergence in all cases.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2: Product ARD kernel
  • Proposition 1
  • Proposition 2
  • Lemma 1: Relation between spectral densities of $k_\theta$ and $k_{\mathbf{1}_d}$
  • proof
  • proof : Proof of Proposition 1
  • proof : Proof of Proposition 2