Table of Contents
Fetching ...

DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

Pengcheng Deng, Kening Liu, Mengxi Zhou, Mingxi Li, Rui Yang, Chuzhe Cao, Maojun Wang, Zeyu Zhang

TL;DR

DPCformer addresses the core challenges of genomic selection in crops by integrating CNN-based local SNP feature extraction with a multi-head self-attention transformer to model both intra- and inter-chromosomal genotype-phenotype relationships. It introduces an eight-dimensional SNP encoding, MIC-based feature selection, MAP-file chromosome segmentation, and a polyploid-aware processing pipeline for cotton, enabling robust predictions even in small-sample contexts. The architecture combines a Res-CNN per chromosome with a cross-chromosome Transformer and an MLP predictor, trained with MSE and validated via 10-fold CV across five crops and multiple traits, achieving state-of-the-art accuracy and providing interpretability via SHAP analyses that highlight biologically plausible candidate genes. These contributions offer a scalable, interpretable framework for precision breeding and have potential to accelerate genetic gains and global food security.

Abstract

Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

TL;DR

DPCformer addresses the core challenges of genomic selection in crops by integrating CNN-based local SNP feature extraction with a multi-head self-attention transformer to model both intra- and inter-chromosomal genotype-phenotype relationships. It introduces an eight-dimensional SNP encoding, MIC-based feature selection, MAP-file chromosome segmentation, and a polyploid-aware processing pipeline for cotton, enabling robust predictions even in small-sample contexts. The architecture combines a Res-CNN per chromosome with a cross-chromosome Transformer and an MLP predictor, trained with MSE and validated via 10-fold CV across five crops and multiple traits, achieving state-of-the-art accuracy and providing interpretability via SHAP analyses that highlight biologically plausible candidate genes. These contributions offer a scalable, interpretable framework for precision breeding and have potential to accelerate genetic gains and global food security.

Abstract

Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: (A) The workflow of DPCformer in crop genomic prediction from SNPs. (B) The limitations of existing methods.
  • Figure 2: The DPCformer model mainly consists of a CNN layer and a multi-head self-attention layer. The CNN layer is used to capture the localization signals of SNPs, while multi-head self-attention makes the model more focused on important SNPs.
  • Figure 3: Prediction accuracy of methods built using five different models on five datasets.
  • Figure 4: Top 20 key genes screened based on the plant height (PH) trait in maize.
  • Figure 5: Top 20 significant SNPs obtained after calculating SHAP values based on the ear weight (EW) trait in maize.