Table of Contents
Fetching ...

Hybrid Random Features

Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller

TL;DR

This work introduces Hybrid Random Features (HRFs) to adaptively approximate softmax and Gaussian kernels with unbiased estimators and reduced worst-case relative error. By unifying trig-based, positive random features, and complex exponential representations under Bochner-like theory, HRFs offer flexible constructions (including bipolar, lambda-angular, and Gaussian-mixtures variants) that trade accuracy for computational efficiency. The authors derive general MSE and variance expressions for many hybrids and provide concrete estimator designs with explicit constructions (A-matrices, angular λs, cluster-based schemes). They validate HRFs across kernel estimation tasks and language-modeling benchmarks, showing improved accuracy and efficiency in linear-attention-like settings and related applications, including robotics-relevant tasks. Overall, HRFs offer a principled, scalable framework to tailor kernel estimations to regions of interest while maintaining unbiasedness and favorable error properties.

Abstract

We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.

Hybrid Random Features

TL;DR

This work introduces Hybrid Random Features (HRFs) to adaptively approximate softmax and Gaussian kernels with unbiased estimators and reduced worst-case relative error. By unifying trig-based, positive random features, and complex exponential representations under Bochner-like theory, HRFs offer flexible constructions (including bipolar, lambda-angular, and Gaussian-mixtures variants) that trade accuracy for computational efficiency. The authors derive general MSE and variance expressions for many hybrids and provide concrete estimator designs with explicit constructions (A-matrices, angular λs, cluster-based schemes). They validate HRFs across kernel estimation tasks and language-modeling benchmarks, showing improved accuracy and efficiency in linear-attention-like settings and related applications, including robotics-relevant tasks. Overall, HRFs offer a principled, scalable framework to tailor kernel estimations to regions of interest while maintaining unbiasedness and favorable error properties.

Abstract

We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.

Paper Structure

This paper contains 26 sections, 10 theorems, 190 equations, 10 figures, 3 tables.

Key Result

Lemma 1.3

For $\mathbf{x}, \mathbf{y} \in \mathbb{R}^{d}$, the following is true: for $\mathbf{z}=\mathbf{x}+\mathbf{y}$ and $\Lambda = \exp(-\frac{\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}}{2})$. Consequently, softmax kernel admits a positive random feature map decomposition with $l=1$ and $\xi_{1}(\mathbf{u}, \omega) = \mathrm{exp}(\omega^{\top}\mathbf{u} - \frac{\|\mathbf{u}\|^{2}

Figures (10)

  • Figure 1: Histogram of angles between keys and test query embeddings from our LSTM trained on Penn Tree Bank (Left) and WikiText2 (Right).
  • Figure 2: Cumulative distribution plot about softmax values, and histogram plots for key length, query length in PTB dataset.
  • Figure 3: 1D Wasserstein distances with different random feature techniques for estimating softmax value matrix in PTB dataset.
  • Figure 4: Kolmogorov–Smirnov distances with different random feature techniques for estimating softmax value matrix in PTB dataset.
  • Figure 5: KL divergence with different random feature techniques for estimating softmax value matrix in PTB dataset.
  • ...and 5 more figures

Theorems & Definitions (21)

  • Definition 1.1: Softmax and Gaussian kernel
  • Definition 1.2: Kernel with a Random Feature Map Representation
  • Lemma 1.3: Positive Random Features
  • Lemma 1.4: positive versus trigonometric random features
  • Lemma 1.5
  • Lemma 1.6
  • proof
  • Definition 1.7
  • Lemma 1.8
  • proof
  • ...and 11 more