G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
Mengdi Liu, Zhangyang Gao, Hong Chang, Stan Z. Li, Shiguang Shan, Xilin Chen
TL;DR
G2PDiffusion reframes cross-species genotype-to-phenotype prediction as conditional image synthesis by generating morphological phenotypes $X$ from genotype $G$ and environment $E$ using a diffusion model. It introduces three innovations: an evolution-aware MSA conditioning pipeline with an MMseqs2-based retrieval engine, an environment-aware MSA encoder, and a dynamic phenomic alignment mechanism to preserve genotype–phenotype consistency during sampling. The approach achieves superior performance on the BIOSCAN-5M dataset compared to baselines, demonstrating strong cross-species generalization and the value of integrating evolutionary and environmental signals. This work provides a promising pathway for AI-assisted genomic analysis, enabling scalable, biologically plausible phenotype prediction across diverse species.
Abstract
Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.
