Table of Contents
Fetching ...

From Deep Additive Kernel Learning to Last-Layer Bayesian Neural Networks via Induced Prior Approximation

Wenyuan Zhao, Haoyuan Chen, Tie Liu, Rui Tuo, Chao Tian

TL;DR

This paper tackles the scalability and uncertainty estimation challenges of Deep Kernel Learning (DKL) by introducing Deep Additive Kernel (DAK), which reinterprets the last-layer GP as a sparse Bayesian neural network via induced prior approximation on 1-D grids. By embedding hierarchical NN features into an additive GP framework and applying a sparse, grid-based prior, DAK achieves linear-time inference with a closed-form ELBO and predictive distribution for regression, while maintaining DKL-style interpretability. Empirically, DAK outperforms state-of-the-art DKL methods on regression and image classification tasks and scales better to high-dimensional feature spaces. The work bridges GPs and NNs through a last-layer Bayesian lens and opens avenues for more general kernels and variational families in scalable kernel learning.

Abstract

With the strengths of both deep learning and kernel methods like Gaussian Processes (GPs), Deep Kernel Learning (DKL) has gained considerable attention in recent years. From the computational perspective, however, DKL becomes challenging when the input dimension of the GP layer is high. To address this challenge, we propose the Deep Additive Kernel (DAK) model, which incorporates i) an additive structure for the last-layer GP; and ii) induced prior approximation for each GP unit. This naturally leads to a last-layer Bayesian neural network (BNN) architecture. The proposed method enjoys the interpretability of DKL as well as the computational advantages of BNN. Empirical results show that the proposed approach outperforms state-of-the-art DKL methods in both regression and classification tasks.

From Deep Additive Kernel Learning to Last-Layer Bayesian Neural Networks via Induced Prior Approximation

TL;DR

This paper tackles the scalability and uncertainty estimation challenges of Deep Kernel Learning (DKL) by introducing Deep Additive Kernel (DAK), which reinterprets the last-layer GP as a sparse Bayesian neural network via induced prior approximation on 1-D grids. By embedding hierarchical NN features into an additive GP framework and applying a sparse, grid-based prior, DAK achieves linear-time inference with a closed-form ELBO and predictive distribution for regression, while maintaining DKL-style interpretability. Empirically, DAK outperforms state-of-the-art DKL methods on regression and image classification tasks and scales better to high-dimensional feature spaces. The work bridges GPs and NNs through a last-layer Bayesian lens and opens avenues for more general kernels and variational families in scalable kernel learning.

Abstract

With the strengths of both deep learning and kernel methods like Gaussian Processes (GPs), Deep Kernel Learning (DKL) has gained considerable attention in recent years. From the computational perspective, however, DKL becomes challenging when the input dimension of the GP layer is high. To address this challenge, we propose the Deep Additive Kernel (DAK) model, which incorporates i) an additive structure for the last-layer GP; and ii) induced prior approximation for each GP unit. This naturally leads to a last-layer Bayesian neural network (BNN) architecture. The proposed method enjoys the interpretability of DKL as well as the computational advantages of BNN. Empirical results show that the proposed approach outperforms state-of-the-art DKL methods in both regression and classification tasks.

Paper Structure

This paper contains 56 sections, 41 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Model architecture of Deep Additive Kernel (DAK). DAK consists of a feature extractor $\text{NN}(\cdot)$ with a linear embedding layer $\mathbf{W}$, an additive kernel with base GP $\tilde{g}_{p}(\cdot)$ for $p=1,\ldots,P$, and a weighted sum layer. The embedded features learned by DNN are decomposed as first-order components and fed to base GPs, each consisting of a kernel activation and a GP forward layer. Each kernel activation is designed by a one-dimensional dyadic point set on an induced grid with sparse non-zero activated neurons.
  • Figure 2: Results on toy dataset. (a)--(d) show the predictive posterior of the exact GP, DGP, exact DKL and proposed DAK model, respectively, on the noisy data generated by 1D GP with zero-mean and covariance function $k(x,x')=\exp( -(x-x')^2 )$. We set the number of MC samples $S=4$ for estimating the expected log-likelihood in ELBO during training. The predictive mean and $\pm$2 standard deviations are plotted together with the observed data. (e) shows the NN fit with the same training data.
  • Figure 3: Results on 1D regression with different last-layer learning rates. The learning rate of NN feature extractor is set as $0.01$. (a)--(f) shows the regression fits and corresponding training losses. DAK fits for the same learning rate strategy with NN feature extractor (lr=0.01), while DKL requires a separate tuning for last-layer learning rate of GPs. Additionally, a better training loss does not necessarily prevent overfitting for DKL.
  • Figure 4: Test errors, test NLLs, ELBOs of NN, SVDKL, and DAK curves with batch size of 128/1024 for CIFAR-10 averaged on 3 runs. DAK outperforms SVDKL on both test error and NLL along the training epochs. Additionally, SVDKL degrades more and struggles to fit when the batch size becomes larger.
  • Figure 5: Test errors, test NLLs, ELBOs of NN, SVDKL, and DAK curves with batch size of 128/1024 for CIFAR-100 averaged on 3 runs. DAK trained NN and last-layer additive GPs jointly, while SVDKL used the pre-trained NN and fine-tuned the last-layer GP since SVDKL struggles to fit using full-training. DAK outperforms SVDKL on both test error and NLL along the training epochs. SVDKL struggled to fit in high-dimensional multitask cases, indicating the necessity of pre-training in SVDKL. However, DAK fitted well with high dimensionality and large batch sizes.