Table of Contents
Fetching ...

The Galerkin method beats Graph-Based Approaches for Spectral Algorithms

Vivien Cabannes, Francis Bach

TL;DR

This work introduces a Galerkin framework for spectral decompositions of a broad class of operators, offering statistical and computational advantages over graph-based approaches. By restricting attention to a finite set of test functions and formulating a GSVD-based recovery of spectral components, it unifies kernel-based methods with random features and Nyström-type ideas, and scales favorably with data via $O(n p^2 c_H + p^3)$ complexity. The Laplacian example demonstrates both strong theoretical guarantees and practical efficiency, including implementations that exploit kernel structure to achieve $O(n p^2 + n p d)$ flops. Beyond linear operators, the paper discusses loss-based optimization to extend spectral learning to non-linear function spaces, connecting spectral methods to self-supervised learning and deep representations. Overall, the Galerkin approach advances scalable, principled spectral analysis with broad applicability to clustering, embeddings, and diffusion-inspired models, complemented by a public software library.

Abstract

Historically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.

The Galerkin method beats Graph-Based Approaches for Spectral Algorithms

TL;DR

This work introduces a Galerkin framework for spectral decompositions of a broad class of operators, offering statistical and computational advantages over graph-based approaches. By restricting attention to a finite set of test functions and formulating a GSVD-based recovery of spectral components, it unifies kernel-based methods with random features and Nyström-type ideas, and scales favorably with data via complexity. The Laplacian example demonstrates both strong theoretical guarantees and practical efficiency, including implementations that exploit kernel structure to achieve flops. Beyond linear operators, the paper discusses loss-based optimization to extend spectral learning to non-linear function spaces, connecting spectral methods to self-supervised learning and deep representations. Overall, the Galerkin approach advances scalable, principled spectral analysis with broad applicability to clustering, embeddings, and diffusion-inspired models, complemented by a public software library.

Abstract

Historically, the machine learning community has derived spectral decompositions from graph-based approaches. We break with this approach and prove the statistical and computational superiority of the Galerkin method, which consists in restricting the study to a small set of test functions. In particular, we introduce implementation tricks to deal with differential operators in large dimensions with structured kernels. Finally, we extend on the core principles beyond our approach to apply them to non-linear spaces of functions, such as the ones parameterized by deep neural networks, through loss-based optimization procedures.
Paper Structure (44 sections, 13 theorems, 123 equations, 10 figures, 1 table, 5 algorithms)

This paper contains 44 sections, 13 theorems, 123 equations, 10 figures, 1 table, 5 algorithms.

Key Result

Theorem 1

Assume that $H(\varphi_i, \psi_j, x)$ is bounded by $H_\infty$ independently of $(i, j, x)$, and that $L$ is invertible. For any $\delta > 0$, and $n > 3 \max(1, p^2 H_\infty^2 \|L^{-1}\|^{-2}) \log(2p/\delta)$, the following holds true with probability at least $1-\delta$ (the randomness coming fro where $\left\| \cdot \right\|$ is the operator norm.

Figures (10)

  • Figure 1: Level lines of the first sixteen learned eigenfunctions of ${\mathcal{L}}_0$ when the data generates two half-moons with $d=2$, with Algorithm \ref{['alg:klap']}, $n=10^5$ points, and $p=200$ Galerkin functions derived from the exponential kernel. See how those eigenfunctions are separated between the two clusters, and how, on each cluster, they identify with Fourier modes (i.e., cosines) when distorting the segment $[0,1]$ into a half-moon.
  • Figure 2: Learning spherical harmonics with polynomials of degree three (with $k_x(y) = (1 + x^\top y)^3$ which corresponds to features maps that concatenates all the multivariate monomials of degree smaller or equal to $s=3$). Because we consider $\rho$ uniform on the sphere, the operator ${\mathcal{L}}_0$ is diagonalized by spherical harmonics, which are polynomials of increasing degrees. The polynomial kernel of degree $D$ allows to learn all harmonics of $s$-th kind for $s$ smaller or equal to $D$ (the ones of higher kind are polynomials of higher degree that can not be reconstructed with polynomials of degree $D$ as illustrated with the fourth kind on the figure). Some of the learned eigenfunctions are represented on the top row, while some ground truths are represented on the bottom row. Our method learned perfectly valid harmonics, although, for eigenvalues that are repeated, it does not learn the canonical ones, but any basis of the different eigenspaces (which can be observed with the harmonics of the second kind in the figure).
  • Figure 3: (Left) Testing error \ref{['eq:sur']} when learning the first 25 "spherical harmonics" eigenvalues as a function of the number of samples $n$ in different dimension $d$ with Galerkin method. (Middle) Same figure with graph-Laplacian. The error is averaged over 100 runs, with standard deviations shown in solid color, and we pick the best result over three kernels with five different parameters each with five different values for $p$ (best of 75 for Galerkin), as well as six different scales for weighting in graph-Laplacian (best of 450 for graph-Laplacian). (Right) Computation time for Galerkin method with polynomial kernel of degree three and $p=177$. Experimental setups and reproducibility specifications are detailed in Appendix \ref{['app:experiments']}.
  • Figure 4: Comparison between plain regression and "Hermite regression" with the Gaussian kernel, $n=1000$ and $p=100$ when learning a constant function without noise (a task known to be hard for the Gaussian kernel).
  • Figure 5: Cherry-picked learned spherical harmonics with a neural network. The network is a multi-layer perceptron with 200, 200, 2000, 200 hidden neurons in the four hidden layers, and $m=16$ outputs optimized over 5000 batches of size 1000 with the contrastive version of the orthogonal regularizer. The optimizer is stochastic gradient descent with momentum ($m=1/2$) initialized with a learning rate $\gamma = 10^{-3}$, with a scheduler to decrease the learning rate after one third and two third of the learning by a factor $1/3$. Principal component analysis was used to disentangle the learned representation and retrieve the different learned eigenspaces and eigenfunctions.
  • ...and 5 more figures

Theorems & Definitions (24)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proposition 1
  • proof
  • Lemma 2: Error decomposition
  • proof
  • Lemma 3
  • proof
  • Lemma 4: Estimation error
  • ...and 14 more