Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Jian Xu; Delu Zeng; John Paisley

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Jian Xu, Delu Zeng, John Paisley

TL;DR

This work adopts natural gradient methods from information geometry for variational parameter optimization of Student-t Processes, utilizing tools such as the Fisher information matrix which is linked to the Beta function in the model.

Abstract

Recently, a sparse version of Student-t Processes, termed sparse variational Student-t Processes, has been proposed to enhance computational efficiency and flexibility for real-world datasets using stochastic gradient descent. However, traditional gradient descent methods like Adam may not fully exploit the parameter space geometry, potentially leading to slower convergence and suboptimal performance. To mitigate these issues, we adopt natural gradient methods from information geometry for variational parameter optimization of Student-t Processes. This approach leverages the curvature and structure of the parameter space, utilizing tools such as the Fisher information matrix which is linked to the Beta function in our model. This method provides robust mathematical support for the natural gradient algorithm when using Student's t-distribution as the variational distribution. Additionally, we present a mini-batch algorithm for efficiently computing natural gradients. Experimental results across four benchmark datasets demonstrate that our method consistently accelerates convergence speed.

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

TL;DR

Abstract

Paper Structure (20 sections, 2 theorems, 48 equations, 2 figures, 1 algorithm)

This paper contains 20 sections, 2 theorems, 48 equations, 2 figures, 1 algorithm.

Introduction
Background and Notations
Student-t Processes
Sparse variational Student-t Processes
Inducing Points Setup
Training Data and Noise Model
Joint Distribution
Conditional Distribution
Optimization
The Steepest Descent
Natural Gradient Learning in SVTP
Fisher Information Matrix as Riemannian Metric Tensor
Representation of Fisher Information Matrix in SVTP
Fisher Information Matrix Linked to the Beta Function
Stochastic Natural Gradient Descent
...and 5 more sections

Key Result

Lemma 1

amari1998natural The steepest descent direction of $\mathcal{L}(\mathbf{\theta})$ in a Riemannian space is given by where $G^{-1} = (g^{ij})$ is the inverse of the metric $G = (g_{ij}(\mathbf{\theta}))$ and $\nabla \mathcal{L}$ is the conventional gradient,

Figures (2)

Figure 1: Negative ELBO Curves for the Four Datasets
Figure 2: Test MSE Curves for the Four Datasets

Theorems & Definitions (4)

Definition 1
Definition 2
Lemma 1
Lemma 2

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

TL;DR

Abstract

Information Geometry and Beta Link for Optimizing Sparse Variational Student-t Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)