Table of Contents
Fetching ...

REML implementations of kernel-based genomic prediction models for genotype x environment x management interactions

Killian A. C. Melsen, Salvador Gezan, Daniel J. Tolhurst, Fred A. van Eeuwijk, Carel F. W. Peeters

TL;DR

This work provides REML-enabled, kernel-based genomic prediction models tailored for genotype-by-environment-by-management (GxExM) interactions in multi-environment trials. By implementing both linear and Gaussian kernels within standard mixed-model software and allowing environment-specific genetic variances, the authors demonstrate improved explanation of GxE variance and enhanced prediction, especially under sparse testing. The approach is validated on two real datasets (BRIWECS and DROPS), showing that nonlinear Gaussian kernels and heterogeneous variances yield higher accuracy than traditional main-effect or factor-analytic models. The framework facilitates integration of environmental covariables, phenomics, and genomics, enabling more flexible and scalable modeling of complex breeding datasets with potential extensions to multi-trait and high-throughput phenotyping data.

Abstract

High-throughput pheno-, geno-, and envirotyping allows characterization of plant genotypes and the trials they are evaluated in, producing different types of -omics data. These different data modalities can be integrated into statistical or machine learning models for genomic prediction in several ways. One commonly used approach within the analysis of multi-environment trial data in plant breeding is to create linear or nonlinear kernels which are subsequently used in linear mixed models (LMMs) to model genotype by environment (GxE) interactions. Current implementations of these kernel-based LMMs present a number of opportunities in terms of methodological extensions. Here we show how these models can be implemented in standard software, allowing direct restricted maximum likelihood (REML) estimation of all parameters. We also extend the models by combining the kernels with unstructured covariance matrices for three-way interactions in genotype by environment by management (GxExM) datasets, while simultaneously allowing for environment-specific genetic variances. We show how the models incorporating nonlinear kernels and heterogeneous variances maximize the amount of genetic variance captured by environmental covariables and perform best in prediction settings. We discuss the opportunities regarding models with multiple kernels or kernels obtained after environmental feature selection, as well as the similarities to models regressing phenotypes on latent and observed environmental covariables. Finally, we discuss the flexibility provided by our implementation in terms of modeling complex plant breeding datasets, allowing for straightforward integration of phenomics, enviromics, and genomics.

REML implementations of kernel-based genomic prediction models for genotype x environment x management interactions

TL;DR

This work provides REML-enabled, kernel-based genomic prediction models tailored for genotype-by-environment-by-management (GxExM) interactions in multi-environment trials. By implementing both linear and Gaussian kernels within standard mixed-model software and allowing environment-specific genetic variances, the authors demonstrate improved explanation of GxE variance and enhanced prediction, especially under sparse testing. The approach is validated on two real datasets (BRIWECS and DROPS), showing that nonlinear Gaussian kernels and heterogeneous variances yield higher accuracy than traditional main-effect or factor-analytic models. The framework facilitates integration of environmental covariables, phenomics, and genomics, enabling more flexible and scalable modeling of complex breeding datasets with potential extensions to multi-trait and high-throughput phenotyping data.

Abstract

High-throughput pheno-, geno-, and envirotyping allows characterization of plant genotypes and the trials they are evaluated in, producing different types of -omics data. These different data modalities can be integrated into statistical or machine learning models for genomic prediction in several ways. One commonly used approach within the analysis of multi-environment trial data in plant breeding is to create linear or nonlinear kernels which are subsequently used in linear mixed models (LMMs) to model genotype by environment (GxE) interactions. Current implementations of these kernel-based LMMs present a number of opportunities in terms of methodological extensions. Here we show how these models can be implemented in standard software, allowing direct restricted maximum likelihood (REML) estimation of all parameters. We also extend the models by combining the kernels with unstructured covariance matrices for three-way interactions in genotype by environment by management (GxExM) datasets, while simultaneously allowing for environment-specific genetic variances. We show how the models incorporating nonlinear kernels and heterogeneous variances maximize the amount of genetic variance captured by environmental covariables and perform best in prediction settings. We discuss the opportunities regarding models with multiple kernels or kernels obtained after environmental feature selection, as well as the similarities to models regressing phenotypes on latent and observed environmental covariables. Finally, we discuss the flexibility provided by our implementation in terms of modeling complex plant breeding datasets, allowing for straightforward integration of phenomics, enviromics, and genomics.
Paper Structure (24 sections, 23 equations, 20 figures, 3 tables)

This paper contains 24 sections, 23 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Graphical representation of the different kernel-based G$\times$E$\times$M covariance structures. For the multiple variance models (A), the unstructured correlation matrix for managements (red) is first combined with the kernel for environments (green) and expanded to all E$\times$M combinations before multiplication with the environment-specific variances (gray). For the single variance models (B), variances are first multiplied with the unstructured correlation matrix for managements before forming the Kronecker product with the kernel for environments. Both the single and multiple variance models finally involve a second Kronecker product with the kinship matrix (blue). Note that the kernel for the environments may be linear or nonlinear.
  • Figure 2: The Gaussian kernel nonlinearly transforms squared Euclidian distances between environments to genetic correlations, with the transformation depending on the bandwidth parameter $h\in \left(0,\infty\right)$. Genetic correlations tend to $0$ for large bandwidth values, regardless of the distance, while for small bandwidth values the opposite is true. Black contour lines are placed at $0.1$ intervals.
  • Figure 3: The locations of field trials in the Briwecs (A) and DROPS (B) datasets after subsetting to a balanced set of records. The DROPS location Graneros (Gra), Chile is not shown.
  • Figure 4: Partitioning of total variance into G$\times$E$\times$M (kernel), lack of fit, and residual effects, averaged over the $13$ environments and shown for each management of the BRIWECS dataset.
  • Figure 5: Pearson correlation and root mean squared error (RMSE) between BLUPs from the linear mixed models and centered test-set phenotypes for the BRIWECS dataset. Shaded areas represent standard errors. Accuracies are averaged over environments and shown separately for the low and high nitrogen managements. Note that as the number of checks increases, the sparsity in the sparse testing scenario decreases. SV = single variance, MV = multiple variance, LK = linear kernel, GK = Gaussian kernel, FA = factor analytic, ME = main effect.
  • ...and 15 more figures