Table of Contents
Fetching ...

Feature maps for the Laplacian kernel and its generalizations

Sudhendu Ahir, Parthe Pandit

TL;DR

This work tackles the challenge of efficiently approximating the non-separable Laplacian kernel and its generalizations (Matérn and Exponential-power) by developing two scalable random-feature families, RFF and ORF, that accommodate anisotropic covariance $M$ and heavy-tailed weight distributions. It derives explicit, implementable weight-sampling schemes for the Laplacian, Matérn, and Exponential-power kernels and proves that the associated random-feature maps converge to the exact kernels as the feature count $p$ grows, even under anisotropy. The authors provide detailed Fourier-transform-based weight constructions, accompanying sampling algorithms (including elliptically contoured $\alpha$-stable, multivariate $t$, and Cauchy distributions), and extensive numerical validation on real datasets, showing speedups and improved calibration for kernel logistic regression. The results offer a practical pathway to scalable, kernel-based learning with non-separable kernels, enabling efficient experimentation and deployment in large-scale settings while retaining theoretical guarantees.

Abstract

Recent applications of kernel methods in machine learning have seen a renewed interest in the Laplacian kernel, due to its stability to the bandwidth hyperparameter in comparison to the Gaussian kernel, as well as its expressivity being equivalent to that of the neural tangent kernel of deep fully connected networks. However, unlike the Gaussian kernel, the Laplacian kernel is not separable. This poses challenges for techniques to approximate it, especially via the random Fourier features (RFF) methodology and its variants. In this work, we provide random features for the Laplacian kernel and its two generalizations: Matérn kernel and the Exponential power kernel. We provide efficiently implementable schemes to sample weight matrices so that random features approximate these kernels. These weight matrices have a weakly coupled heavy-tailed randomness. Via numerical experiments on real datasets we demonstrate the efficacy of these random feature maps.

Feature maps for the Laplacian kernel and its generalizations

TL;DR

This work tackles the challenge of efficiently approximating the non-separable Laplacian kernel and its generalizations (Matérn and Exponential-power) by developing two scalable random-feature families, RFF and ORF, that accommodate anisotropic covariance and heavy-tailed weight distributions. It derives explicit, implementable weight-sampling schemes for the Laplacian, Matérn, and Exponential-power kernels and proves that the associated random-feature maps converge to the exact kernels as the feature count grows, even under anisotropy. The authors provide detailed Fourier-transform-based weight constructions, accompanying sampling algorithms (including elliptically contoured -stable, multivariate , and Cauchy distributions), and extensive numerical validation on real datasets, showing speedups and improved calibration for kernel logistic regression. The results offer a practical pathway to scalable, kernel-based learning with non-separable kernels, enabling efficient experimentation and deployment in large-scale settings while retaining theoretical guarantees.

Abstract

Recent applications of kernel methods in machine learning have seen a renewed interest in the Laplacian kernel, due to its stability to the bandwidth hyperparameter in comparison to the Gaussian kernel, as well as its expressivity being equivalent to that of the neural tangent kernel of deep fully connected networks. However, unlike the Gaussian kernel, the Laplacian kernel is not separable. This poses challenges for techniques to approximate it, especially via the random Fourier features (RFF) methodology and its variants. In this work, we provide random features for the Laplacian kernel and its two generalizations: Matérn kernel and the Exponential power kernel. We provide efficiently implementable schemes to sample weight matrices so that random features approximate these kernels. These weight matrices have a weakly coupled heavy-tailed randomness. Via numerical experiments on real datasets we demonstrate the efficacy of these random feature maps.

Paper Structure

This paper contains 23 sections, 12 theorems, 31 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let $\left\{w_i\right\}_{i=1}^p$ be i.i.d. samples from a distribution over $\mathbb{R}^d$ whose characteristic function is $\frac{1}{c_\kappa}\kappa$. Let $\psi_p$ be an elementwise nonlinearity defined in eq:nonlinearity, and suppose the rows of $W\in\mathbb{R}^{p\times d}$ are $w_i$. Then

Figures (6)

  • Figure 1: Trade-off between computational cost and approximation error in evaluating Matern $\nu=4$ kernel across datasets with increasing dimensions and increasing $p$ for $n=10,000$ samples. $\Phi\in\mathbb{R}^{n\times p}$ is the matrix of random features.
  • Figure 2: Evaluation of ORF. (Left): relative approximation error for the kernel matrix for various datasets in frobenius norm, for $n=10,000$ samples. Error measured in other norms (operator, nuclear) are mentioned in \ref{['appendix:expts']} (See Figures \ref{['fig:approx_op_norm']}, \ref{['fig:approx_nuc_norm']}). Here, $\Phi\in\mathbb{R}^{n\times p}$ is the matrix of random features. (Right): performance of random features predictor in comparison to exact solution via EigenPro2 ma2019kernel. The unnormalized numbers are available in \ref{['tab:krr']}. See \ref{['fig:RFF_eval']} for similar evaluation of RFF.
  • Figure 3: Evaluation of RFF. (Left): relative approximation error for the kernel matrix various datasets in frobenius norm, for $n=10,000$ samples. Error measured in other norms (operator, nuclear) are mentioned in Figures \ref{['fig:approx_op_norm']}, \ref{['fig:approx_nuc_norm']}. (Right): performance of random features predictor in comparison to exact solution via EigenPro2 ma2019kernel.
  • Figure 4: Amount of speedup for Kernel computation, along with relative Frobenius error for $n=10,000$ samples. (1): for the AQI dataset $(d=12)$, (2): for the FMNIST dataset $(d=784)$.
  • Figure 5: Relative approximation error (operator norm) for the kernel matrix using the RFF sampling, for $n=10,000$ samples. (Left): $\Phi$ is computed using the RFF sampling. (Right): $\Phi$ is computed using the ORF sampling.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Remark 1: Generalizations of Laplacian kernels
  • Definition 1: Mahalanobis norm
  • Definition 2: $\chi(k)$ and $\chi^2(k)$ distributions
  • Definition 3: BetaPrime$(\alpha,\beta)$ distribution
  • Definition 4: Generalized Beta Prime ( GBP) distribution
  • Definition 5: Multivariate $t_{2\nu}$-distribution
  • Definition 6: Univariate stable distribution nolan2020univariate
  • Definition 7: Multivariate $\alpha$-stable distribution
  • Definition 8
  • Definition 9: matern kernel
  • ...and 16 more