Table of Contents
Fetching ...

Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels

Tamim El Ahmad, Luc Brogat-Motte, Pierre Laforgue, Florence d'Alché-Buc

TL;DR

This work tackles the scalability gap of surrogate kernel methods for structured prediction by introducing SISOKR, which applies random projections to both input and output feature maps. It provides excess-risk bounds that decompose into non-sketched regression error plus input/output sketch reconstruction errors, with learning rates that improve when using sub-Gaussian sketches. Theoretical results demonstrate that small sketch sizes, informed by eigendecay, suffice to retain near-optimal rates, while experiments show substantial reductions in training and inference time with competitive accuracy on real-world datasets. Overall, the approach enables scalable, provably sound structured prediction with kernelized outputs, broadening the applicability of kernel surrogate methods to large-scale problems.

Abstract

Leveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approximations, applied to both the input and output feature maps. We prove excess risk bounds on the original structured prediction problem, showing how to attain close-to-optimal rates with a reduced sketch size that depends on the eigendecay of the input/output covariance operators. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. Empirically, our approach is shown to scale, achieving state-of-the-art performance on benchmark data sets where non-sketched methods are intractable.

Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels

TL;DR

This work tackles the scalability gap of surrogate kernel methods for structured prediction by introducing SISOKR, which applies random projections to both input and output feature maps. It provides excess-risk bounds that decompose into non-sketched regression error plus input/output sketch reconstruction errors, with learning rates that improve when using sub-Gaussian sketches. Theoretical results demonstrate that small sketch sizes, informed by eigendecay, suffice to retain near-optimal rates, while experiments show substantial reductions in training and inference time with competitive accuracy on real-world datasets. Overall, the approach enables scalable, provably sound structured prediction with kernelized outputs, broadening the applicability of kernel surrogate methods to large-scale problems.

Abstract

Leveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approximations, applied to both the input and output feature maps. We prove excess risk bounds on the original structured prediction problem, showing how to attain close-to-optimal rates with a reduced sketch size that depends on the eigendecay of the input/output covariance operators. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. Empirically, our approach is shown to scale, achieving state-of-the-art performance on benchmark data sets where non-sketched methods are intractable.
Paper Structure (47 sections, 24 theorems, 104 equations, 4 figures, 5 tables)

This paper contains 47 sections, 24 theorems, 104 equations, 4 figures, 5 tables.

Key Result

Proposition 1

$\forall\,x \in \mathcal{X}$, where $\tilde{\alpha}\left(x\right) = \mathop{\mathrm{R_\mathcal{Y}}}\nolimits^\top \widetilde{\Omega} \mathop{\mathrm{R_\mathcal{X}}}\nolimits \mathop{\mathrm{k_X^x}}\nolimits$ and with $\mathop{\mathrm{\widetilde{K}_X}}\nolimits = \mathop{\mathrm{R_\mathcal{X}}}\nolimits \mathop{\mathrm{K_X}}\nolimits \mathop{\mathrm{R_\mathcal{X}}}\nolimits^\top$ and $\mathop{\m

Figures (4)

  • Figure 1: IOKR (left) and SISOKR (right) in the KDE setting. Note that SISOKR consists in IOKR when kernels $\mathop{\mathrm{k_\mathcal{Z}}}\nolimits$ are replaced with their projected versions $\tilde{k}_\mathcal{Z}(\cdot, \cdot) = \langle \mathop{\mathrm{\psi_\mathcal{Z}}}\nolimits(\cdot), \mathop{\mathrm{\widetilde{P}_Z}}\nolimits \mathop{\mathrm{\psi_\mathcal{Z}}}\nolimits(\cdot) \rangle_{\mathop{\mathrm{\mathcal{H}_\mathcal{Z}}}\nolimits}$. However, this new output kernel changes the pre-image problem, and consequently the estimator $\tilde{f}$. In the paper, we modify $\widetilde{H}$ (and not the kernels) in order to use the comparison inequality from ciliberto2020general, see the proof of \ref{['corollary:SISOKR_lr']}.
  • Figure 2: Variation of training and inference time w.r.t. $\mathop{\mathrm{m_\mathcal{X}}}\nolimits$ and $\mathop{\mathrm{m_\mathcal{Y}}}\nolimits$ (left and center), and trade-off performance against computational time (right) for SISOKR with $(2 \cdot 10^{-3})$-SR input/output sketches on synthetic data.
  • Figure 3: Test MSE with respect to $\mathop{\mathrm{m_\mathcal{X}}}\nolimits$ and $\mathop{\mathrm{m_\mathcal{Y}}}\nolimits$ for the SISOKR model with $(2 \cdot 10^{-3})$-SR input and output sketches.
  • Figure 4: Test MSE with respect to $\mathop{\mathrm{m_\mathcal{X}}}\nolimits$ and $\mathop{\mathrm{m_\mathcal{Y}}}\nolimits$ for a SIOKR and ISOKR model respectively with $(2 \cdot 10^{-3})$-SR input and output sketches.

Theorems & Definitions (52)

  • Proposition 1: Expression of SISOKR
  • Theorem 1: SISOKR excess-risk bound
  • proof : Proof sketch.
  • Definition 1
  • Theorem 2: sub-Gaussian sketching reconstruction error
  • proof : Proof sketch
  • Remark 1: Comparison to Nyström's approximation
  • Remark 2: Relaxation of \ref{['asm:emb']}
  • Corollary 1: SISOKR learning rates
  • proof
  • ...and 42 more