Table of Contents
Fetching ...

G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion

Mengdi Liu, Zhangyang Gao, Hong Chang, Stan Z. Li, Shiguang Shan, Xilin Chen

TL;DR

G2PDiffusion reframes cross-species genotype-to-phenotype prediction as conditional image synthesis by generating morphological phenotypes $X$ from genotype $G$ and environment $E$ using a diffusion model. It introduces three innovations: an evolution-aware MSA conditioning pipeline with an MMseqs2-based retrieval engine, an environment-aware MSA encoder, and a dynamic phenomic alignment mechanism to preserve genotype–phenotype consistency during sampling. The approach achieves superior performance on the BIOSCAN-5M dataset compared to baselines, demonstrating strong cross-species generalization and the value of integrating evolutionary and environmental signals. This work provides a promising pathway for AI-assisted genomic analysis, enabling scalable, biologically plausible phenotype prediction across diverse species.

Abstract

Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.

G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion

TL;DR

G2PDiffusion reframes cross-species genotype-to-phenotype prediction as conditional image synthesis by generating morphological phenotypes from genotype and environment using a diffusion model. It introduces three innovations: an evolution-aware MSA conditioning pipeline with an MMseqs2-based retrieval engine, an environment-aware MSA encoder, and a dynamic phenomic alignment mechanism to preserve genotype–phenotype consistency during sampling. The approach achieves superior performance on the BIOSCAN-5M dataset compared to baselines, demonstrating strong cross-species generalization and the value of integrating evolutionary and environmental signals. This work provides a promising pathway for AI-assisted genomic analysis, enabling scalable, biologically plausible phenotype prediction across diverse species.

Abstract

Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.

Paper Structure

This paper contains 33 sections, 14 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Ground truth images (top row) and generated images conditioning on DNA (bottom row).
  • Figure 2: G2PDiffusion generates morphological images using advanced diffusion model and cross-species large data.
  • Figure 3: G2PDiffusion for genotype-to-phenotype image synthesis. It first utilizes the MMseq to retrieve evolutionary alignments (in Section \ref{['sec:Constructing']}). Then the retrieved MSA are fed into an environment-enhanced MSA conditioner that integrates them with environmental factors, i.e., longitude and latitude (in Section \ref{['sec:G2P']}). Additionally, a cross-modality alignment guidance mechanism is employed to ensure genotype-phenotype consistency during sampling (in Section \ref{['sec:Alignment']}).
  • Figure 4: Density Distribution of DNA-Image CLIBDScore.
  • Figure 5: Generative results. All methods can generate visually reasonable images with different the DNA-image consistency.
  • ...and 1 more figures