Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations

J. E. San Soucie; H. M. Sosik; Y. Girdhar

Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations

J. E. San Soucie, H. M. Sosik, Y. Girdhar

TL;DR

This work addresses the challenge of predicting and planning with streaming, spatiotemporal, sparse, high-dimensional categorical observations in autonomous sensing. It introduces the Streaming Gaussian Dirichlet Random Field (S-GDRF), a streaming extension that combines Gaussian-process–driven spatial priors with Dirichlet latent communities over observation categories, enabling interpolation and planning. A novel subsampling-based streaming BBVI with a sparse inducing-point approximation achieves bounded-time, linear-space inference, with complexity $O(n_s m^3)$. Empirical results on plankton imagery and reef imagery show that S-GDRF outperforms a single GP per category (VGP), scales to thousands of categories, and delivers real-time inference suitable for onboard deployment, thereby enabling informative path planning over high-dimensional categorical observations.

Abstract

We present the Streaming Gaussian Dirichlet Random Field (S-GDRF) model, a novel approach for modeling a stream of spatiotemporally distributed, sparse, high-dimensional categorical observations. The proposed approach efficiently learns global and local patterns in spatiotemporal data, allowing for fast inference and querying with a bounded time complexity. Using a high-resolution data series of plankton images classified with a neural network, we demonstrate the ability of the approach to make more accurate predictions compared to a Variational Gaussian Process (VGP), and to learn a predictive distribution of observations from streaming categorical data. S-GDRFs open the door to enabling efficient informative path planning over high-dimensional categorical observations, which until now has not been feasible.

Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations

TL;DR

. Empirical results on plankton imagery and reef imagery show that S-GDRF outperforms a single GP per category (VGP), scales to thousands of categories, and delivers real-time inference suitable for onboard deployment, thereby enabling informative path planning over high-dimensional categorical observations.

Abstract

Paper Structure (19 sections, 1 equation, 5 figures)

This paper contains 19 sections, 1 equation, 5 figures.

Introduction and Background
Introduction
Background
Gaussian processes.
Methods
Technical Approach
Gaussian-Dirichlet Random Fields.
Streaming inference for S-GDRFs.
Evaluation of GDRF predictive power
Experiments
Temporal data.
Spatial data.
Results
1-dimensional S-GDRF prediction.
2-dimensional S-GDRF inference.
...and 4 more sections

Figures (5)

Figure 1: The graphical model for GDRFs (reproduced from sansoucie2020gaussian)
Figure 2: In July 2021 aboard the R/V Endeavor, an Imaging FlowCytobot was used to take high-throughput, high-resolution images of plankton in surface seawater. Post-cruise, all images were classified with a neural network-based classifier. The S-GDRF model enables prediction of this kind of spatiotemporally distributed, sparse, high-dimensional categorical data.
Figure 3: S-GDRF inference on the Joel's Shoal coral head shows reasonable labelings for pixels in a 130x130 grid with 15436 unique extracted features.
Figure 4: Time-series plots of the distribution of plankton, with model fits (left of the dashed line) and predictions (right of the dashed line) in out-of-coverage areas (more than one kernel lengthscale from any datapoint left of the black line). Due to the large number of plankton taxa, only the top ten taxa (by mean observed relative abundance) are marked in the legend.
Figure 5: S-GDRF and VGP predictive Kullbach-Liebler divergence (PKL) metric, for the 1-D temporal dataset. The x-axis is the number of training data, while the y-axis represents the PKL metric, i.e. the KL divergence between model predictions on all other data and observations. The median and quartile PKL for all unobserved (i.e. non-training) data is displayed with solid lines. The dashed lines represent the coverage fraction, i.e. what proportion of the unobserved data points like within one GP kernel lengthscale of a previously observed datapoint.

Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations

TL;DR

Abstract

Streaming Gaussian Dirichlet Random Fields for Spatial Predictions of High Dimensional Categorical Observations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)