Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Burak Yelmen; Maris Alver; Merve Nur Güler; Estonian Biobank Research Team; Flora Jay; Lili Milani

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Burak Yelmen, Maris Alver, Merve Nur Güler, Estonian Biobank Research Team, Flora Jay, Lili Milani

TL;DR

The paper addresses the challenge of deciphering the genetic architecture of complex diseases beyond linear GWAS by training neural networks to predict phenotypes from genome-wide SNP data. It introduces a general interpretability framework using post hoc methods—Saliency Maps (SM), Integrated Gradients (IG), and Permutation feature importance (PM)—to compute mean attribution scores (MAS), derive adjusted MAS (AMAS), and identify potentially associated loci (PAL) with p-values estimated from permuted-label nulls, applying strict and relaxed thresholds at the 99.99th and 99.95th percentiles, respectively. Simulations show that IG maintains better false positive control under the strict threshold, and the Estonian Biobank schizophrenia (SCZ) application detects multiple PAL, including a lead SNP on chromosome 11 with a p-value around $5 \times 10^{-8}$, with PALs enriched for brain-related genes and overlapping with known SCZ/bipolar loci. The study discusses limitations in uncertainty quantification and LD confounding, but suggests interpretable neural networks as a promising screening tool to augment GWAS and guide functional follow-up.

Abstract

Investigating the genetic architecture of complex diseases is challenging due to the multifactorial and interactive landscape of genomic and environmental influences. Although genome-wide association studies (GWAS) have identified thousands of variants for multiple complex traits, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis in models. In this work, we trained artificial neural networks to predict complex traits using both simulated and real genotype-phenotype datasets. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated loci (PAL) for the target phenotype and devised an approach for obtaining p-values for the detected PAL. Simulations with various parameters demonstrated that associated loci can be detected with good precision using strict selection criteria. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we detected multiple loci associated with this highly polygenic and heritable disorder. There was significant concordance between PAL and loci previously associated with schizophrenia and bipolar disorder, with enrichment analyses of genes within the identified PAL predominantly highlighting terms related to brain morphology and function. With advancements in model optimization and uncertainty quantification, artificial neural networks have the potential to enhance the identification of genomic loci associated with complex diseases, offering a more comprehensive approach for GWAS and serving as initial screening tools for subsequent functional studies.

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

TL;DR

, with PALs enriched for brain-related genes and overlapping with known SCZ/bipolar loci. The study discusses limitations in uncertainty quantification and LD confounding, but suggests interpretable neural networks as a promising screening tool to augment GWAS and guide functional follow-up.

Abstract

Paper Structure (1 section, 15 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 1 section, 15 equations, 3 figures, 1 table, 1 algorithm.

Keywords:

Figures (3)

Figure 1: Overview of the approach for obtaining potentially associated loci (PAL) from feature attribution scores obtained via SM, IG and PM methods.
Figure 2: Comparison of true positive (TP) and false positive (FP) counts for the integrated gradient (IG) and logistic regression (LR) methods based on various simulation scenarios (n: noise factor, c: number of causal positions, t: relaxed $\theta$ threshold) and different significant p-value thresholds. A signal was determined to be TP if one or more SNP in a detected PAL block (i.e., blocks formed by clumping detected SNPs less than $100$kb distance) was in close proximity ($\pm100$kb, approximately $\pm20$ SNPs) with a causal position. Positive values (above 0 on the y-axis) indicate TP counts whereas negative values (below 0 on the y-axis) indicate FP counts.
Figure 3: PAL detected by the integrated gradients (IG) approach. Red dashed lines indicate $\theta$ thresholds (relaxed and strict) and blue markers indicate PAL above threshold over all trained 10 models (i.e., $PAL_{Common}$). The closest protein coding gene to the lead SNP in $\pm100$kb region was provided for PAL (a-g).

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

TL;DR

Abstract

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Authors

TL;DR

Abstract

Table of Contents

Figures (3)