Table of Contents
Fetching ...

Optimal Kernel Quantile Learning with Random Features

Caixing Wang, Xingdong Feng

TL;DR

This work develops a capacity-dependent statisticalAnalysis for kernel quantile regression with random features (KQR-RF). Extending beyond the standard KRR-RF framework, it handles non-smooth check loss and agnostic settings where the target function may lie outside the RKHS by introducing a refined error decomposition and adaptive self-calibration. The authors prove sharp learning rates, showing that with a suitable number of random features $M$ and regularization $\lambda$, the excess risk decays as ${\cal E}(f_{M,D,\lambda})-{\cal E}(f_{\tau}^*)=O(|D|^{-\frac{2r}{2r+\gamma}}\log^2|D|)$, and demonstrate that data-dependent sampling (leverage scores) reduces the feature burden while attaining the same rates. The results extend naturally to Lipschitz continuous losses and are supported by simulations and a real-data application, confirming both theoretical and practical gains of the proposed approach.

Abstract

The random feature (RF) approach is a well-established and efficient tool for scalable kernel methods, but existing literature has primarily focused on kernel ridge regression with random features (KRR-RF), which has limitations in handling heterogeneous data with heavy-tailed noises. This paper presents a generalization study of kernel quantile regression with random features (KQR-RF), which accounts for the non-smoothness of the check loss in KQR-RF by introducing a refined error decomposition and establishing a novel connection between KQR-RF and KRR-RF. Our study establishes the capacity-dependent learning rates for KQR-RF under mild conditions on the number of RFs, which are minimax optimal up to some logarithmic factors. Importantly, our theoretical results, utilizing a data-dependent sampling strategy, can be extended to cover the agnostic setting where the target quantile function may not precisely align with the assumed kernel space. By slightly modifying our assumptions, the capacity-dependent error analysis can also be applied to cases with Lipschitz continuous losses, enabling broader applications in the machine learning community. To validate our theoretical findings, simulated experiments and a real data application are conducted.

Optimal Kernel Quantile Learning with Random Features

TL;DR

This work develops a capacity-dependent statisticalAnalysis for kernel quantile regression with random features (KQR-RF). Extending beyond the standard KRR-RF framework, it handles non-smooth check loss and agnostic settings where the target function may lie outside the RKHS by introducing a refined error decomposition and adaptive self-calibration. The authors prove sharp learning rates, showing that with a suitable number of random features and regularization , the excess risk decays as , and demonstrate that data-dependent sampling (leverage scores) reduces the feature burden while attaining the same rates. The results extend naturally to Lipschitz continuous losses and are supported by simulations and a real-data application, confirming both theoretical and practical gains of the proposed approach.

Abstract

The random feature (RF) approach is a well-established and efficient tool for scalable kernel methods, but existing literature has primarily focused on kernel ridge regression with random features (KRR-RF), which has limitations in handling heterogeneous data with heavy-tailed noises. This paper presents a generalization study of kernel quantile regression with random features (KQR-RF), which accounts for the non-smoothness of the check loss in KQR-RF by introducing a refined error decomposition and establishing a novel connection between KQR-RF and KRR-RF. Our study establishes the capacity-dependent learning rates for KQR-RF under mild conditions on the number of RFs, which are minimax optimal up to some logarithmic factors. Importantly, our theoretical results, utilizing a data-dependent sampling strategy, can be extended to cover the agnostic setting where the target quantile function may not precisely align with the assumed kernel space. By slightly modifying our assumptions, the capacity-dependent error analysis can also be applied to cases with Lipschitz continuous losses, enabling broader applications in the machine learning community. To validate our theoretical findings, simulated experiments and a real data application are conducted.
Paper Structure (35 sections, 28 theorems, 145 equations, 8 figures, 4 tables)

This paper contains 35 sections, 28 theorems, 145 equations, 8 figures, 4 tables.

Key Result

Theorem 3.8

Assume there exists a function $f_{\cal H}$ such that $f_{\cal H}=\mathop{\rm argmin}_{f \in {\cal H}_K}{\cal E}(f)$. Under some technical assumptionsAssumption 3.3, Assumption 3.4 with $r=1/2$, eigenvalue decaying assumption (stronger than Assumption 3.5), and the local strongly convex assumption w and $|D|$ is sufficiently large, there holds with probability near to 1.

Figures (8)

  • Figure 1: Comparison between the number of random features $M={\cal O}(|D|^c)$ required for uniform sampling ($\alpha=1$, left) and leverage scores sampling ($\alpha=\gamma$, right), Figure \ref{['agnostic_case']} is the agnostic case and Figure \ref{['realizable_case']} is the realizable case, respectively.
  • Figure 2: Estimated and true quantile curves for $r=0,\gamma=1$ (left), $r=1/2, \gamma=1$ (middle), and $r=1, \gamma=0$ (right) when $\tau=0.5$.
  • Figure 3: Log empirical excess risk for $r=0.2,\gamma=0.1$ (left top), $r=0.4, \gamma=0.2$ (right top), $r=0.5, \gamma=0.1$ (left bottom) and $r=0.8, \gamma=0.2$ (right bottom) when $\tau=0.5$.
  • Figure 4: Averaged PQE and its standard deviation against the number of random features used in KQR-RF under various scenarios.
  • Figure 5: Averaged PQE and its standard deviation against the number of random features used in KQR-RF for different sampling strategies in the homoscedastic case.
  • ...and 3 more figures

Theorems & Definitions (53)

  • Definition 3.1: Integral operators
  • Definition 3.2: Effective dimension
  • Remark 3.7
  • Theorem 3.8: Existing learning rates for KQR-RF (random features with Lipschitz loss), Theorem 19 of li2021towards
  • Theorem 3.9: Worst case
  • Remark 3.10
  • Remark 3.11
  • Theorem 3.13
  • Example 3.14: Leverage scores sampling
  • Remark 3.15
  • ...and 43 more