Table of Contents
Fetching ...

Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder

Anirudha Powadi, Talukder Zaki Jubery, Michael C. Tross, James C. Schnable, Baskar Ganapathysubramanian

TL;DR

A compositional autoencoder that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features and out-performs PCA, PLSR, PLSR and vanilla autoencoders and significantly enhanced trait prediction models, advancing agricultural and biological sciences.

Abstract

This study introduces a compositional autoencoder (CAE) framework designed to disentangle the complex interplay between genotypic and environmental factors in high-dimensional phenotype data to improve trait prediction in plant breeding and genetics programs. Traditional predictive methods, which use compact representations of high-dimensional data through handcrafted features or latent features like PCA or more recently autoencoders, do not separate genotype-specific and environment-specific factors. We hypothesize that disentangling these features into genotype-specific and environment-specific components can enhance predictive models. To test this, we developed a compositional autoencoder (CAE) that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features. Our CAE framework employs a hierarchical architecture within an autoencoder to effectively separate these entangled latent features. Applied to a maize diversity panel dataset, the CAE demonstrates superior modeling of environmental influences and 5-10 times improved predictive performance for key traits like Days to Pollen and Yield, compared to the traditional methods, including standard autoencoders, PCA with regression, and Partial Least Squares Regression (PLSR). By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research. This work significantly enhances trait prediction models, advancing agricultural and biological sciences.

Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder

TL;DR

A compositional autoencoder that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features and out-performs PCA, PLSR, PLSR and vanilla autoencoders and significantly enhanced trait prediction models, advancing agricultural and biological sciences.

Abstract

This study introduces a compositional autoencoder (CAE) framework designed to disentangle the complex interplay between genotypic and environmental factors in high-dimensional phenotype data to improve trait prediction in plant breeding and genetics programs. Traditional predictive methods, which use compact representations of high-dimensional data through handcrafted features or latent features like PCA or more recently autoencoders, do not separate genotype-specific and environment-specific factors. We hypothesize that disentangling these features into genotype-specific and environment-specific components can enhance predictive models. To test this, we developed a compositional autoencoder (CAE) that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features. Our CAE framework employs a hierarchical architecture within an autoencoder to effectively separate these entangled latent features. Applied to a maize diversity panel dataset, the CAE demonstrates superior modeling of environmental influences and 5-10 times improved predictive performance for key traits like Days to Pollen and Yield, compared to the traditional methods, including standard autoencoders, PCA with regression, and Partial Least Squares Regression (PLSR). By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research. This work significantly enhances trait prediction models, advancing agricultural and biological sciences.

Paper Structure

This paper contains 27 sections, 10 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: The problem definition: Extract and disentangle the effects of genotype and environment for a given type of sensor data, assuming multiple observations of each genotype in each environment. Our specific test dataset was a set of hyperspectral reflectance data collected from 578 distinct genotypes of maize in two distinct environments, with two replicates of each genotype in each environment (four total replicates per genotype and 1,156 total replicates per environment).
  • Figure 2: The goal of this research is to disentangle the input hyperspectral data into genotype-specific information, environment specific information and plant-specific information. This achieved by the method of composition.
  • Figure 3: Trait prediction workflow of a Vanilla Autoencoder vs Compositional Autonencoder.
  • Figure 4: Hyperspectral leaf reflectance data was collected using a FieldSpec4 (Malvern Panalytical Ltd., Formerly Analytical Spectral Devices) with a contact probe. A total of 2151 wavelengths were collected, ranging from 350 nm to 2500 nm. The dataset consists of measurements for a set of 578 different maize inbred genotypes that were grown and phenotyped in two different environments with 2 replicates per environment.
  • Figure 5: A vanilla autoencoder works to learn a compressed yet highly informative representation of the input data.
  • ...and 6 more figures