Table of Contents
Fetching ...

Inferring genotype-phenotype maps using attention models

Krishna Rijal, Caroline M. Holmes, Samantha Petti, Gautam Reddy, Michael M. Desai, Pankaj Mehta

TL;DR

This work addresses genotype-phenotype mapping by moving beyond additive linear models to attention-based architectures that learn context-dependent epistasis and gene-environment interactions. The authors develop a single-environment model with genotype embeddings and three stacked attention layers, plus a multi-environment extension that uses environment tokens to capture cross-environment information and enable transfer learning. Through simulations and empirical budding yeast QTL data, the attention-based approach shows superior out-of-sample predictions in epistatic regimes and the ability to transfer predictions to new environments with limited data. The approach preserves linear information while uncovering subtle interactions, offering a scalable framework for modeling complex genotype-phenotype landscapes across environments.

Abstract

Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.

Inferring genotype-phenotype maps using attention models

TL;DR

This work addresses genotype-phenotype mapping by moving beyond additive linear models to attention-based architectures that learn context-dependent epistasis and gene-environment interactions. The authors develop a single-environment model with genotype embeddings and three stacked attention layers, plus a multi-environment extension that uses environment tokens to capture cross-environment information and enable transfer learning. Through simulations and empirical budding yeast QTL data, the attention-based approach shows superior out-of-sample predictions in epistatic regimes and the ability to transfer predictions to new environments with limited data. The approach preserves linear information while uncovering subtle interactions, offering a scalable framework for modeling complex genotype-phenotype landscapes across environments.

Abstract

Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for "transfer learning" - predicting phenotypes in novel environments with limited training data.

Paper Structure

This paper contains 32 sections, 42 equations, 11 figures.

Figures (11)

  • Figure 1: Schematic illustrating standard regression-based and attention-based methods for genotype-phenotype mapping. (a) Genotype sequences represented as vectors $\mathbf{x}^{(g)}$ for the $g$-th individual, with phenotype $y^{(g)}_\alpha$ in environment $\alpha$. (b) Classical series expansion model, combining linear and higher-order epistasis, fitted by minimizing the loss function with regularization. (c) Genotype vectors are converted to one-hot embeddings $X^{(g)}$ and transformed into $d$-dimensional embeddings $Z^{(g)}$. (d) Flowchart illustrating attention-based architecture for a case of $L=7$ loci and $d = 4$. The embeddings pass through multiple attention layers, followed by prediction. (e) The optimization process involves computing the loss $L(\mathbf{\theta})$, obtaining the gradient of the loss $\nabla_{\mathbf{\theta}} L$, and updating the parameters $\mathbf{\theta}$.
  • Figure 2: Model comparison on synthetic data. (a) Schematic of pipeline for generating simulated data. We create subsampled genotype vectors $\mathbf{x}^{(g)}$ for individual $g$ from the real genotype data in ba2022barcoded, and simulate the phenotype of this individual according to the equation shown, with coefficients drawn from Gaussian (main text) or Exponential (Appendix) distributions. (b) We define the model prediction for the effect size of the $l$-th locus in the $g$-th background genotype as $\Delta f_l^{(g)}$, which is the difference between the predicted phenotype with the $l$-th locus being 1 versus -1 in that genetic background at all other loci. (c) $R^2$ scores for linear, linear+pairwise, and attention-based models across simulated data sets as a function of the simulated strength of epistasis $\epsilon$. Left panel shows the case of $L=100$ causal loci, while middle panel shows the case of $L=300$ causal loci. Right panel shows performance at $\epsilon = 0.3$ for varying training dataset sizes. (d) For $d=30$, $\epsilon = 0.3$, and $L = 100$, the predicted effect sizes for each locus from different models are compared with the true effect sizes. Simulated data are generated using Gaussian-distributed coefficients.
  • Figure 3: Comparison of model performance in yeast QTL mapping data. We show $R^2$ on test datasets for linear, linear+pairwise, and attention-based model (with $d=12$) across 18 phenotypes (relative growth rates in various environments). For the linear + pairwise mode, the causal loci inferred by ba2022barcoded are used.
  • Figure 4: Multi-environment attention-based model and performance comparison. (a) Schematic of multi-environment attention-based architecture. One-hot environmental vectors are created for each environment, combined with genotype embeddings $Z^{(g)}$, and processed through multiple attention layers to predict phenotypes $y^{(g)}_{\text{pred}}$. (b) $R^2$ performance on test datasets for linear, linear+pairwise, and attention-based model (with $d=12$) across various environments. Note that the linear and linear+pairwise models were trained separately for each environment.
  • Figure 5: Transfer learning. (a) Schematic of our approach to transfer learning. Environment-$E$ has fewer training data points than the other environments. Training data from different environments are sampled and used to train the model for phenotype prediction. (b,c) $R^2$ on the test dataset for (b) simulated data (with $d=30$) and (c) experimental yeast QTL data (with $d=12$). Along the horizontal axes, "training data" refers specifically to the number of genotypes from the new temperature $T$ included in the training set; for all other temperatures, the maximum available training set size is used. For the linear and linear+pairwise models in (b), the same number of training data points from temperature $T$ is used. Simulated data are generated using Gaussian-distributed coefficients with $\epsilon=0.3$ and $L=100$.
  • ...and 6 more figures