Table of Contents
Fetching ...

Transfer Learning Bayesian Optimization to Design Competitor DNA Molecules for Use in Diagnostic Assays

Ruby Sedgwick, John P. Goertz, Molly M. Stevens, Ruth Misener, Mark van der Wilk

TL;DR

This paper uses cross-validation to compare the predictive accuracy of different transfer learning models, and then compares the performance of the models for both single objective and penalized optimization tasks.

Abstract

With the rise in engineered biomolecular devices, there is an increased need for tailor-made biological sequences. Often, many similar biological sequences need to be made for a specific application meaning numerous, sometimes prohibitively expensive, lab experiments are necessary for their optimization. This paper presents a transfer learning design of experiments workflow to make this development feasible. By combining a transfer learning surrogate model with Bayesian optimization, we show how the total number of experiments can be reduced by sharing information between optimization tasks. We demonstrate the reduction in the number of experiments using data from the development of DNA competitors for use in an amplification-based diagnostic assay. We use cross-validation to compare the predictive accuracy of different transfer learning models, and then compare the performance of the models for both single objective and penalized optimization tasks.

Transfer Learning Bayesian Optimization to Design Competitor DNA Molecules for Use in Diagnostic Assays

TL;DR

This paper uses cross-validation to compare the predictive accuracy of different transfer learning models, and then compares the performance of the models for both single objective and penalized optimization tasks.

Abstract

With the rise in engineered biomolecular devices, there is an increased need for tailor-made biological sequences. Often, many similar biological sequences need to be made for a specific application meaning numerous, sometimes prohibitively expensive, lab experiments are necessary for their optimization. This paper presents a transfer learning design of experiments workflow to make this development feasible. By combining a transfer learning surrogate model with Bayesian optimization, we show how the total number of experiments can be reduced by sharing information between optimization tasks. We demonstrate the reduction in the number of experiments using data from the development of DNA competitors for use in an amplification-based diagnostic assay. We use cross-validation to compare the predictive accuracy of different transfer learning models, and then compare the performance of the models for both single objective and penalized optimization tasks.
Paper Structure (43 sections, 20 equations, 13 figures, 8 tables)

This paper contains 43 sections, 20 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Design of experiments workflow for optimizing the competitor DNA molecules. (A) Data is collected in the lab using a DNA amplification reaction assay. (B) The rate and drift are then calculated by fitting amplification curves. (C) A transfer learning surrogate model uses the data to predict the rate and drift for each of the given competitors. The LVMOGP is introduced in Section \ref{['sec:lvmogp']}. Information is shared through the latent space, with one point on the latent space for each competitor. The shaded regions indicate the uncertainty. The 3D plots are predictions of the model for given competitors. (D) The Bayesian optimization algorithm, introduced in Section \ref{['sec:bayes_opt']}, combines information about the rate and drift surfaces in an acquisition function to select the experiment to run for each competitor. The solid lines in the rate and drift plots represent the mean of the Gaussian process models, while the shaded regions are $2\times\text{standard deviation}$. This process is repeated until all optimal competitor sequences are found or the experimental budget is exhausted.
  • Figure 2: Predictions of the four Gaussian process models fitted to a toy dataset with linear correlation between output surfaces. The dots are the data, the dashed line is the true function, the solid line is the Gaussian process mean prediction and the shaded region is two times the predicted standard deviation, meaning around $95\%$ of the data points should lie within the shaded region. The bottom row explains how data is transferred between the surfaces by each model. For the average Gaussian process (AvgGP), all data is assumed to be from the same surface, for the multioutput Gaussian process (MOGP) information is only transferred about the hyperparameter values but not the function values. In the linear model of coregionalisation (LMC) information is transferred via the similarity matrix $\mathbf{B}$ and in the latent variable multiouput Gaussian process (LVMOGP) it is transferred through the latent space. Theoretically, LMC and LVMOGP can learn if information can be transferred and (if so), how much.
  • Figure 3: Schematic of the competitor design space. For a given competitor DNA molecule, the primers and fluorescent probe regions are fixed. We can edit the design region to ensure the sequence has a given number of base pairs and guanine-cytosine content. Changing the number of base pairs and guanine-cytosine-content affects the rate and drift of the competitor, allowing us to fine-tune to the rate and drift required for the diagnostic assay.
  • Figure 4: Results of experiments with synthetically-generated data. The plots on the left show example data-generating functions used for the synthetic experiments. The plots on the right show the RMSE and NLPD for the three different test response surface types for each of the Gaussian process models. New points are added randomly, and each line is the mean of 5 different randomly generated data sets, all generated from the same test functions.
  • Figure 5: Results of cross-validation on the DNA amplification data for both rate and drift. For each cross-validation run, the training set consisted of all the data from two competitors and a random subset of the data on the remaining competitors, ensuring all competitors had at least one data point. This is repeated for different percentages of data in the training set, and for each percentage, it is repeated 70 times.
  • ...and 8 more figures