Table of Contents
Fetching ...

Deep Sketched Output Kernel Regression for Structured Prediction

Tamim El Ahmad, Junjie Yang, Pierre Laforgue, Florence d'Alché-Buc

TL;DR

The paper tackles structured-output prediction by integrating kernel-induced losses with deep neural networks. It introduces Deep Sketched Output Kernel Regression (DSOKR), which constrains infinite-dimensional output features to a data-dependent finite subspace obtained via sketching the empirical output covariance, enabling gradient-based learning of the input network while the last layer encodes outputs in a learned RKHS basis. Learning proceeds in two steps: first estimate the eigenbasis from sketched KPCA, then train the input network with the last layer fixed, followed by a pre-image decoding step at inference. DSOKR is demonstrated on synthetic least-squares regression and real-world graph-related tasks, including SMILES-to-molecule and ChEBI-20, where it consistently matches or surpasses baselines while reducing the last-layer parameter count. This approach provides a scalable path to end-to-end training with kernel losses for complex structured outputs, with practical impact on molecular design and graph prediction domains.

Abstract

By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.

Deep Sketched Output Kernel Regression for Structured Prediction

TL;DR

The paper tackles structured-output prediction by integrating kernel-induced losses with deep neural networks. It introduces Deep Sketched Output Kernel Regression (DSOKR), which constrains infinite-dimensional output features to a data-dependent finite subspace obtained via sketching the empirical output covariance, enabling gradient-based learning of the input network while the last layer encodes outputs in a learned RKHS basis. Learning proceeds in two steps: first estimate the eigenbasis from sketched KPCA, then train the input network with the last layer fixed, followed by a pre-image decoding step at inference. DSOKR is demonstrated on synthetic least-squares regression and real-world graph-related tasks, including SMILES-to-molecule and ChEBI-20, where it consistently matches or surpasses baselines while reducing the last-layer parameter count. This approach provides a scalable path to end-to-end training with kernel losses for complex structured outputs, with practical impact on molecular design and graph prediction domains.

Abstract

By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.
Paper Structure (33 sections, 3 theorems, 25 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 3 theorems, 25 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

elahmad2024sketch The eigenfunctions of the sketched empirical covariance operator $\mathop{\mathrm{\widetilde{C}}}\nolimits = \mathop{\mathrm{S}}\nolimits^{\#}\!\mathop{\mathrm{R}}\nolimits^\top\! \mathop{\mathrm{R}}\nolimits \mathop{\mathrm{S}}\nolimits$ are the $\tilde{e}_j = \sqrt{\frac{n}{\sigm

Figures (7)

  • Figure 1: Illustration of DSOKR model.
  • Figure 2: Sorted 400 highest ALS (left), validation MSE of Perfect h w.r.t. $\mathop{\mathrm{m}}\nolimits$ (center) and the difference between test MSE of DSOKR and NN w.r.t. $\mathop{\mathrm{m}}\nolimits$ (right).
  • Figure 3: The GED w/ edge feature w.r.t. the sketching size $\mathop{\mathrm{m}}\nolimits$ for Perfect h for three graph kernels on SMI2Mol ($\mathop{\mathrm{m}}\nolimits > 6400$ is too costly computationally).
  • Figure 4: Predicted molecules on the SMI2Mol dataset.
  • Figure 5: The MRR scores on ChEBI-20 validation set w.r.t. $\mathop{\mathrm{m}}\nolimits$ for Perfect h when the output kernel is Cosine or Gaussian on the ChEBI-20 dataset.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Remark 1: Input Neural net's last layers
  • Proposition 1
  • Remark 2: Random Fourier Features
  • Proposition 2
  • Remark 3: Beyond the square loss
  • Proposition 2
  • proof
  • Definition 1: Vertex Histogram kernel
  • Definition 2: Shortest-Path kernel BorgwardtSP
  • Definition 3: Neighborhood Subgraph Pairwise Distance kernel costa_fast_2010
  • ...and 2 more