Table of Contents
Fetching ...

Mixtures of Neural Network Experts with Application to Phytoplankton Flow Cytometry Data

Ethan Pawl, François Ribalet, Paul A. Parker, Sangwon Hyun

Abstract

Flow cytometry is a valuable technique that measures the optical properties of particles at a single-cell resolution. When deployed in the ocean, flow cytometry allows oceanographers to study different types of photosynthetic microbes called phytoplankton. It is of great interest to study how phytoplankton properties change in response to environmental conditions. In our work, we develop a nonlinear mixture of experts model to estimate separate regression functions for each subpopulation utilizing random-weight neural networks. Our model allows one to flexibly estimate how cell properties and relative abundances depend on environmental covariates in each segment of a heterogeneous sample, without the computational burden of backpropagation. We show that the proposed model provides superior predictive performance in simulated examples compared to a mixture of linear experts. Also, applying our model to real data, we show that our model has (1) comparable out-of-sample prediction performance, and (2) more realistic estimates of phytoplankton behavior.

Mixtures of Neural Network Experts with Application to Phytoplankton Flow Cytometry Data

Abstract

Flow cytometry is a valuable technique that measures the optical properties of particles at a single-cell resolution. When deployed in the ocean, flow cytometry allows oceanographers to study different types of photosynthetic microbes called phytoplankton. It is of great interest to study how phytoplankton properties change in response to environmental conditions. In our work, we develop a nonlinear mixture of experts model to estimate separate regression functions for each subpopulation utilizing random-weight neural networks. Our model allows one to flexibly estimate how cell properties and relative abundances depend on environmental covariates in each segment of a heterogeneous sample, without the computational burden of backpropagation. We show that the proposed model provides superior predictive performance in simulated examples compared to a mixture of linear experts. Also, applying our model to real data, we show that our model has (1) comparable out-of-sample prediction performance, and (2) more realistic estimates of phytoplankton behavior.

Paper Structure

This paper contains 9 sections, 14 equations, 20 figures.

Figures (20)

  • Figure 1: Principal components of the environmental covariates. The top panel shows the sample correlations between the first four principal components, latitude, and the covariates. The bottom-left panel shows the first principal component (PC1), as well as the most scientifically relevant covariates which are strongly correlated with PC1, plotted over time. We also overlay latitude as a dashed black line to show its similarity to PC1. The two vertical dotted lines at latitudes 31.9$^\circ$ N and 34$^\circ$ N represent the estimated boundaries of the North Pacific Transition Zone, a region where latitude-varying conditions change rapidly, hand-picked as the region where PC1 changes rapidly. In the bottom-right panel, we display a similar plot for PC3 and PC4. In Supplement A, we display a similar plot for PC2 (Supplementary Figure S1) and the full set of principal components (Supplementary Figure S2).
  • Figure 2: Estimated model on simulated 1-dimensional data. Both panels display in the background the simulated and binned data $y_b^{(t)}$ over time $t$, generated from the "Interaction Mean" model, with signal size $\Delta = 0.40384$. The greyscale color of the bin $b$ at time $t$ is proportional to the biomass $c_b^{(t)}$. Overlaid are the estimated parameters $\hat{\boldsymbol{\mu}}_{k, t}$ over time $t$ from the linear (left) and nonlinear (right) models as solid colored lines. These lines' thickness at time $t$ is proportional to the estimated cluster probability $\hat{\pi}_{k,t}$. The thin colored lines are the bounds of symmetric 95% pointwise conditional probability regions of each Gaussian. The black dashed lines represent the true cluster means, and the black dotted lines represent the true 95% central probability regions.
  • Figure 3: Out-of-sample negative log-pseudolikelihoods (NLPLs) after subtracting the NLPL of the oracle of each model at each signal size. The linear model is represented by the dashed line and the nonlinear by the solid line.
  • Figure 4: Cytogram in two dimensions, diameter and chlorophyll, at time $t = 33$ (2017-07-02 08:00-09:00), with model estimates overlaid in red (marking the important clusters) or grey (the rest). The data in the background is binned, with the intensity of blue hue proportional to the total biomass in each bin. The solid points mark the cluster means, and the size of the points are proportional to the estimated cluster probabilities. The ellipses represent 2-dimensional projections of symmetric cluster-conditional 95% probability sets of each cluster.
  • Figure 5: Mean and probability estimates over time for the Prochlorococcus cluster. In the left panel, the three dimensions of the cluster mean are plotted together; diameter, chlorophyll, and phycoerythrin estimates are represented by the red, blue, and yellow lines, respectively. In the right panel, relative abundance predictions are displayed. The nonlinear model predictions are represented by the solid lines and the linear model predictions by dot-dashed lines. The vertical dotted lines represent the boundaries of the North Pacific Transition Zone. Supplementary Figures S3–-S5 in Supplement B present similar plots for Synechococcus and PicoEukaryote clusters 1 and 2, respectively.
  • ...and 15 more figures