Deep Sketched Output Kernel Regression for Structured Prediction
Tamim El Ahmad, Junjie Yang, Pierre Laforgue, Florence d'Alché-Buc
TL;DR
The paper tackles structured-output prediction by integrating kernel-induced losses with deep neural networks. It introduces Deep Sketched Output Kernel Regression (DSOKR), which constrains infinite-dimensional output features to a data-dependent finite subspace obtained via sketching the empirical output covariance, enabling gradient-based learning of the input network while the last layer encodes outputs in a learned RKHS basis. Learning proceeds in two steps: first estimate the eigenbasis from sketched KPCA, then train the input network with the last layer fixed, followed by a pre-image decoding step at inference. DSOKR is demonstrated on synthetic least-squares regression and real-world graph-related tasks, including SMILES-to-molecule and ChEBI-20, where it consistently matches or surpasses baselines while reducing the last-layer parameter count. This approach provides a scalable path to end-to-end training with kernel losses for complex structured outputs, with practical impact on molecular design and graph prediction domains.
Abstract
By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.
