Table of Contents
Fetching ...

CSGDN: Contrastive Signed Graph Diffusion Network for Predicting Crop Gene-phenotype Associations

Yiru Pan, Xingyu Ji, Jiaqi You, Lu Li, Zhenping Liu, Xianlong Zhang, Zeyu Zhang, Maojun Wang

TL;DR

CSGDN tackles the challenge of predicting positive/negative gene–phenotype associations under data scarcity and noise by modeling associations as a signed bipartite graph and applying signed graph diffusion to enrich structure. It then uses four augmented views built via graph diffusion and random edge masking, coupled with two GAT-based encoders for the positive and negative edge types, and a multi-view contrastive objective to learn robust node embeddings with limited supervision. A supplementary MLP enables encoding of TWAS-missing genes, and a joint loss combines contrastive regularization with a standard sign-prediction loss. Experiments on three crops show CSGDN outperforms unsigned and signed baselines, with strong robustness to small samples and edge perturbations, highlighting practical utility for crop genomics and gene–phenotype discovery.

Abstract

Positive and negative association prediction between gene and phenotype helps to illustrate the underlying mechanism of complex traits in organisms. The transcription and regulation activity of specific genes will be adjusted accordingly in different cell types, developmental stages, and physiological states. There are the following two problems in obtaining the positive/negative associations between gene and trait: 1) High-throughput DNA/RNA sequencing and phenotyping are expensive and time-consuming due to the need to process large sample sizes; 2) experiments introduce both random and systematic errors, and, meanwhile, calculations or predictions using software or models may produce noise. To address these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy. CSGDN employs a signed graph diffusion method to uncover the underlying regulatory associations between genes and phenotypes. Then, stochastic perturbation strategies are used to create two views for both original and diffusive graphs. Lastly, a multi-view contrastive learning paradigm loss is designed to unify the node presentations learned from the two views to resist interference and reduce noise. We conduct experiments to validate the performance of CSGDN on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. The results demonstrate that the proposed model outperforms state-of-the-art methods by up to 9.28% AUC for link sign prediction in G. hirsutum dataset.

CSGDN: Contrastive Signed Graph Diffusion Network for Predicting Crop Gene-phenotype Associations

TL;DR

CSGDN tackles the challenge of predicting positive/negative gene–phenotype associations under data scarcity and noise by modeling associations as a signed bipartite graph and applying signed graph diffusion to enrich structure. It then uses four augmented views built via graph diffusion and random edge masking, coupled with two GAT-based encoders for the positive and negative edge types, and a multi-view contrastive objective to learn robust node embeddings with limited supervision. A supplementary MLP enables encoding of TWAS-missing genes, and a joint loss combines contrastive regularization with a standard sign-prediction loss. Experiments on three crops show CSGDN outperforms unsigned and signed baselines, with strong robustness to small samples and edge perturbations, highlighting practical utility for crop genomics and gene–phenotype discovery.

Abstract

Positive and negative association prediction between gene and phenotype helps to illustrate the underlying mechanism of complex traits in organisms. The transcription and regulation activity of specific genes will be adjusted accordingly in different cell types, developmental stages, and physiological states. There are the following two problems in obtaining the positive/negative associations between gene and trait: 1) High-throughput DNA/RNA sequencing and phenotyping are expensive and time-consuming due to the need to process large sample sizes; 2) experiments introduce both random and systematic errors, and, meanwhile, calculations or predictions using software or models may produce noise. To address these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy. CSGDN employs a signed graph diffusion method to uncover the underlying regulatory associations between genes and phenotypes. Then, stochastic perturbation strategies are used to create two views for both original and diffusive graphs. Lastly, a multi-view contrastive learning paradigm loss is designed to unify the node presentations learned from the two views to resist interference and reduce noise. We conduct experiments to validate the performance of CSGDN on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. The results demonstrate that the proposed model outperforms state-of-the-art methods by up to 9.28% AUC for link sign prediction in G. hirsutum dataset.

Paper Structure

This paper contains 23 sections, 16 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: CSGDN abstracts the associations between genes and phenotypes into a signed bipartite graph. Our task is to predict the gene-phenotype associations by constructing a neural network framework for the bipartite graph.
  • Figure 2: The overall architecture of CSGDN.
  • Figure 3: The frame for genes can not be associated with phenotypes.
  • Figure 4: Hyperparameter sensitivity CSGDN in the G. hirsutum dataset.