Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

Jihao Andreas Lin; Sebastian Ament; Maximilian Balandat; Eytan Bakshy

Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

Jihao Andreas Lin, Sebastian Ament, Maximilian Balandat, Eytan Bakshy

TL;DR

The GP model can match the performance of a Transformer on a learning curve prediction task and interpret the joint covariance matrix of observed values as the projection of a latent Kronecker product.

Abstract

A key task in AutoML is to model learning curves of machine learning models jointly as a function of model hyper-parameters and training progression. While Gaussian processes (GPs) are suitable for this task, naïve GPs require $\mathcal{O}(n^3m^3)$ time and $\mathcal{O}(n^2 m^2)$ space for $n$ hyper-parameter configurations and $\mathcal{O}(m)$ learning curve observations per hyper-parameter. Efficient inference via Kronecker structure is typically incompatible with early-stopping due to missing learning curve values. We impose $\textit{latent Kronecker structure}$ to leverage efficient product kernels while handling missing values. In particular, we interpret the joint covariance matrix of observed values as the projection of a latent Kronecker product. Combined with iterative linear solvers and structured matrix-vector multiplication, our method only requires $\mathcal{O}(n^3 + m^3)$ time and $\mathcal{O}(n^2 + m^2)$ space. We show that our GP model can match the performance of a Transformer on a learning curve prediction task.

Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

TL;DR

Abstract

time and

space for

hyper-parameter configurations and

learning curve observations per hyper-parameter. Efficient inference via Kronecker structure is typically incompatible with early-stopping due to missing learning curve values. We impose

to leverage efficient product kernels while handling missing values. In particular, we interpret the joint covariance matrix of observed values as the projection of a latent Kronecker product. Combined with iterative linear solvers and structured matrix-vector multiplication, our method only requires

time and

space. We show that our GP model can match the performance of a Transformer on a learning curve prediction task.

Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

TL;DR

Abstract

Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)