Table of Contents
Fetching ...

A Compositional Kernel Model for Feature Learning

Feng Ruan, Keli Liu, Michael Jordan

TL;DR

This work analyzes a compositional variant of kernel ridge regression where the predictor is applied to a coordinate-wise reweighting of inputs, formalized as min_{\beta} min_{f} E[(Y - f(\beta \circ X))^2] + \lambda \|f\|_H^2. Through a population-level variational lens, it establishes that global minimizers and directional stationary points can discard noise coordinates when the noise is Gaussian, and that nonsmooth, \ell_1-type kernels like the Laplace kernel enable recovery of nonlinear main effects, unlike Gaussian kernels which primarily detect linear effects. The authors develop a first-variation framework in translation-invariant RKHSs, provide integral representations and error controls for the directional derivatives, and study feature-recovery via a core sufficient feature set under a functional ANOVA model. These results illuminate how kernel choice and optimization geometry govern feature learning and variable selection in compositional kernel architectures, with implications for understanding representation learning in nonlinear, kernel-based systems. The findings highlight the Laplace kernel as particularly suited for sparse, interpretable feature recovery and denoising of irrelevant coordinates, offering a theoretical foundation for feature learning in nonconvex, nonsmooth settings. Practical impact lies in guiding kernel choice and optimization strategies to achieve faithful feature discovery in high-dimensional, noise-prone data regimes.

Abstract

We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that $\ell_1$-type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

A Compositional Kernel Model for Feature Learning

TL;DR

This work analyzes a compositional variant of kernel ridge regression where the predictor is applied to a coordinate-wise reweighting of inputs, formalized as min_{\beta} min_{f} E[(Y - f(\beta \circ X))^2] + \lambda \|f\|_H^2. Through a population-level variational lens, it establishes that global minimizers and directional stationary points can discard noise coordinates when the noise is Gaussian, and that nonsmooth, \ell_1-type kernels like the Laplace kernel enable recovery of nonlinear main effects, unlike Gaussian kernels which primarily detect linear effects. The authors develop a first-variation framework in translation-invariant RKHSs, provide integral representations and error controls for the directional derivatives, and study feature-recovery via a core sufficient feature set under a functional ANOVA model. These results illuminate how kernel choice and optimization geometry govern feature learning and variable selection in compositional kernel architectures, with implications for understanding representation learning in nonlinear, kernel-based systems. The findings highlight the Laplace kernel as particularly suited for sparse, interpretable feature recovery and denoising of irrelevant coordinates, offering a theoretical foundation for feature learning in nonconvex, nonsmooth settings. Practical impact lies in guiding kernel choice and optimization strategies to achieve faithful feature discovery in high-dimensional, noise-prone data regimes.

Abstract

We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that -type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

Paper Structure

This paper contains 41 sections, 41 theorems, 239 equations.

Key Result

Lemma 1

For any $f \in H$, we have $\left\|{f}\right\|_{L_\infty} \le \left\|{f}\right\|_H$, and $f$ is continuous, and $\lim_{x \to \infty} f(x) = 0$.

Theorems & Definitions (96)

  • Lemma 1: Continuous embedding of $H$ in $C(\mathbb{R}^d)$
  • proof
  • Lemma 2: Denseness of $H$ in $L_2(\mu)$
  • proof
  • Example 1: Gaussian RKHS
  • Example 2: Laplace RKHS
  • Example 3: Radial Kernel RKHS
  • Example 4: $\ell_1$ type RKHS
  • Lemma 3
  • proof
  • ...and 86 more