Table of Contents
Fetching ...

Learning Genetic Circuit Modules with Neural Networks: Full Version

Jichi Wang, Eduardo D. Sontag, Domitilla Del Vecchio

Abstract

In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system's compositional structure to (a) identify the composing modules' input/output functions from the system's input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules' input/output functions from a subset of the system's input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules' input/output functions and predict the system's output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.

Learning Genetic Circuit Modules with Neural Networks: Full Version

Abstract

In several applications, including in synthetic biology, one often has input/output data on a system composed of many modules, and although the modules' input/output functions and signals may be unknown, knowledge of the composition architecture can significantly reduce the amount of training data required to learn the system's input/output mapping. Learning the modules' input/output functions is also necessary for designing new systems from different composition architectures. Here, we propose a modular learning framework, which incorporates prior knowledge of the system's compositional structure to (a) identify the composing modules' input/output functions from the system's input/output data and (b) achieve this by using a reduced amount of data compared to what would be required without knowledge of the compositional structure. To achieve this, we introduce the notion of modular identifiability, which allows recovery of modules' input/output functions from a subset of the system's input/output data, and provide theoretical guarantees on a class of systems motivated by genetic circuits. We demonstrate the theory on computational studies showing that a neural network (NNET) that accounts for the compositional structure can learn the composing modules' input/output functions and predict the system's output on inputs outside of the training set distribution. By contrast, a neural network that is agnostic of the structure is unable to predict on inputs that fall outside of the training set distribution. By reducing the need for experimental data and allowing module identification, this framework offers the potential to ease the design of synthetic biological circuits and of multi-module systems more generally.

Paper Structure

This paper contains 12 sections, 4 theorems, 54 equations, 5 figures.

Key Result

Proposition 1

The model is modularly identifiable, i.e., for all $u \in [0, 1]$,

Figures (5)

  • Figure 2: Illustration of the system $\Sigma$. The system consists of $n$ modules, each defined by $y_i = f_i(u_i)$ for $i \in \{1, 2\}$, whose outputs are fed into a composition map $G$.
  • Figure 3: Illustration showing $n$ transcriptional regulation modules that share ribosomes.
  • Figure 4: Define $E_G := \max_k |G(\hat{f}(u_k)) - G(f (u_k))|/ \max_k G(f(u_k))$ and $E_f := \max_k |\hat{f}(u_k) - f(u_k)| / \max_k f(u_k)$, in which $u_k$ are the elements of the training set. (a) The plot shows the error convergence over 1000 epochs. The training dataset comprises 100 input/output pairs $(u_k, Y_k)$, where inputs $u_k$ is uniformly sampled from the interval $[0, 1]$. We choose $\hat{f}(u)$ as a fully connected feedforward NNET with four hidden layers, each containing 20 ReLU-activated neurons, and initialize its weights using Kaiming initialization he2015delving. The NNET $\hat{f}(u)$ is trained using the Adam optimizer with a learning rate of 0.1 and full-batch gradient descent, with the cost function given by the mean square error $\frac{1}{100}\sum_{k = 1}^{100} |G(\hat{f}(u_k)) - G(f(u_k))|^2$. (b) The plot shows the comparison between the trained $\hat{f}(u)$ and true $f(u)$ at the final (1000th) epoch.
  • Figure 5: Error convergence over 84,000 epochs for the case of two modules. Define the following errors: $E_{Gi} := \max_{k} |G_i(\hat{f}_1(u_{1_k}), \hat{f}_2(u_{2_k}), \hat{\theta}) - G_i(f_1(u_{1_k}), f_2(u_{2_k}), \theta)| / \max_{k}G_i(f_1(u_{1_k}), f_2(u_{2_k}), \theta)$, $E_{fi} := \max_{k} |\hat{f}_i(u_{i_k}) - f_i(u_{i_k})| / \max_{k}f_i(u_{i_k})$, and $E_{\theta i} := |\hat{\theta}_i - \theta_i| / \theta_i$, for $i = \{1, 2\}$. (a) The plot shows the convergence of $E_{G1}$ and $E_{G2}$. (b) The plot shows the convergence of $E_{f1}$ and $E_{f2}$. (c) The plot shows the convergence of $E_{\theta 1}$ and $E_{\theta 2}$. (d) The plot shows the comparison between the true functions $f_1(u_1)$, $f_2(u_2)$ and their learned counterparts $\hat{f}_1(u_1)$, $\hat{f}_2(u_2)$ at the final (84,000th) epoch. Also, at the final epoch, the learned parameters $\hat{\theta}_1 = 0.70325, \hat{\theta}_2 = 0.20361$ vs. the true parameters $\theta_1 = 0.70340, \theta_2 = 0.20357$.
  • Figure 6: We use the same simulation setup as in Fig \ref{['fig:2_modules_combined_2by2_plot']} to compare the generalization ability of our modular learning approach and a monolithic learning approach. The modular learning model is the same as in Fig \ref{['fig:2_modules_combined_2by2_plot']}. Let $\textit{Modular-model}$ denote the trained modular learning model and $\textit{Mono-model}$ be the trained monolithic learning model. For $M \in \{\textit{Modular-model}, \textit{Mono-model}\}$ and $i \in \{1, 2\}$ , the point-wise error is defined as $E_{G_i}^M(u_1, u_2):= |M(u_1, u_2)_i - G_i(f_1(u_1), f_2(u_2), \theta)| / G_i(f_1(u_1), f_2(u_2), \theta)$. The loss surfaces are generated from 10,000 test grid points uniformly sampled over $[0,1]^2$.

Theorems & Definitions (9)

  • Definition 1
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof