Table of Contents
Fetching ...

Commutative algebra neural network reveals genetic origins of diseases

JunJie Wee, Faisal Suwayyid, Mushal Zia, Hongsong Feng, Yuta Hozumi, Guo-Wei Wei

TL;DR

A unified framework based on multiscale commutative algebra to capture intrinsic physical and chemical interactions for the first time and offers multiscale, mechanistic, interpretable, and generalizable models for predicting disease-mutation associations.

Abstract

Genetic mutations can disrupt protein structure, stability, and solubility, contributing to a wide range of diseases. Existing predictive models often lack interpretability and fail to integrate physical and chemical interactions critical to molecular mechanisms. Moreover, current approaches treat disease association, stability changes, and solubility alterations as separate tasks, limiting model generalizability. In this study, we introduce a unified framework based on multiscale commutative algebra to capture intrinsic physical and chemical interactions for the first time. Leveraging Persistent Stanley-Reisner Theory, we extract multiscale algebraic invariants to build a Commutative Algebra neural Network (CANet). Integrated with transformer features and auxiliary physical features, we apply CANet to tackle three key domains for the first time: disease-associated mutations, mutation-induced protein stability changes, and solubility changes upon mutations. Across six benchmark tasks, CANet and its gradient boosting tree counterpart, CATree, consistently attain state-of-the-art performance, achieving up to 7.5% improvement in predictive accuracy. Our approach offers multiscale, mechanistic, interpretable,and generalizable models for predicting disease-mutation associations.

Commutative algebra neural network reveals genetic origins of diseases

TL;DR

A unified framework based on multiscale commutative algebra to capture intrinsic physical and chemical interactions for the first time and offers multiscale, mechanistic, interpretable, and generalizable models for predicting disease-mutation associations.

Abstract

Genetic mutations can disrupt protein structure, stability, and solubility, contributing to a wide range of diseases. Existing predictive models often lack interpretability and fail to integrate physical and chemical interactions critical to molecular mechanisms. Moreover, current approaches treat disease association, stability changes, and solubility alterations as separate tasks, limiting model generalizability. In this study, we introduce a unified framework based on multiscale commutative algebra to capture intrinsic physical and chemical interactions for the first time. Leveraging Persistent Stanley-Reisner Theory, we extract multiscale algebraic invariants to build a Commutative Algebra neural Network (CANet). Integrated with transformer features and auxiliary physical features, we apply CANet to tackle three key domains for the first time: disease-associated mutations, mutation-induced protein stability changes, and solubility changes upon mutations. Across six benchmark tasks, CANet and its gradient boosting tree counterpart, CATree, consistently attain state-of-the-art performance, achieving up to 7.5% improvement in predictive accuracy. Our approach offers multiscale, mechanistic, interpretable,and generalizable models for predicting disease-mutation associations.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of commutative algebra neural network (CANet) workflow. a. 3D protein structures obtained from the PDB. Mutant proteins are generated from the Jackal softwarexiang2002jackal. b. Mutational-site and its local neighborhood atom subsets are extracted from both the wild-type and mutant structures to form element specific subcomplexes. Multiscale commutative algebra embedding is performed to generate the persistent facet ideals and persistent $f$-vector curves. c. Auxiliary features such as surface area, secondary structure and ESM-2 transformer-based features are also generated. d. Commutative algebra features are concatenated with auxiliary and ESM-2 transformer-based features to form a long feature vector. Features are then fed into the downstream CANet model. The hyperparameters of CANet are optimized. Colors of dotted frames and arrows indicate workflows in different modules: a. and b. Commutative algebra-based module (blue), c. Auxiliary and ESM-2 transformer-based module (orange), d. CANet module (purple).
  • Figure 2: Illustration of commutative algebra model performance in predicting disease-related mutations, protein stability changes and protein solubility changes upon mutation. a. Blind test performance of CATree with existing state-of-the-art models pires2020mcsm in predicting disease-associated mutations. b. 10-fold cross-validation performance of CATree in predicting disease-associated mutations. c. Comparison of experimental PSC with predicted ones from CANet for S350 dataset. d. Comparison of experimental protein stability changes (PSC) with predicted ones from CANet for S2648 dataset. e. Performance of CANet and CATree for S2648 dataset compared to existing state-of-the-art models cang2017topologynetworth2011sdmquan2016strum. f. Performance of CANet and CATree for S350 dataset compared to existing state-of-the-art models cang2017topologynetworth2011sdmquan2016strum. g. Accuracy scores of CANet and CATree for mutation-induced protein solubility change classification compared with existing state-of-the-art models wee2024integrationyang2021pon. Dark blue bars represent the accuracy scores and light blue bars are its normalized accuracies.
  • Figure 3: Electrostatic interaction analysis and mutation impact on protein structure and pathogenicity sun2022electrostatics. a. Structural shift of the A38E mutation on protein membrane human aquaporin 5 (PDB ID: 3D9S) from surface to interior region, i.e. [Sur, Int]. b. The number of pathogenic and benign samples in M546 dataset broken down for each region-region pair. c. Balanced accuracy of CATree's M546 prediction stratified by four mutation region combinations. d. Results of CATree's prediction grouped by amino acid types, showing various impact on balanced accuracy score. Bold numbers indicate sample counts per cell. e. Persistent facet ideals reveal hydrogen bonding interactions prior to disruption caused by the D614G mutation in the SARS-CoV-2 spike protein (PDB ID: 6VSB) wrapp2020cryo. Further illustration of the mutation region is depicted in Supplementary Fig. S8. The hydrogen bond are represented by the appearance of the green dimension-1 facet ideal after two green dimension-0 facet ideals stopped persisting at 2.74Å.
  • Figure 4: Figure 3 (cont'd): f. Persistent facet ideals illustrate salt bridge formation in the amyloid fibril structure (PDB ID: 7DWV) wang2021genetic following mutation E196K linked to genetic prion disease. Dimension-0 facet ideals persist up to 3.35Å and 3.40Å (in green), representing two N–O atom pairs. The emergence of dimension-1 ideals at these distances (in green) marks the formation of a salt bridge, reflecting new electrostatic interactions introduced by lysine.
  • Figure 5: Illustrations of multiscale commutative algebra analysis on point-cloud data using a Rips complex-based filtration process: a. Facet persistence barcode for 6 points. b. Facet persistence barcode for a cuboid with dimensions $1 \times 1 \times 1.5$. c. Facet persistence barcode for the $C_{\alpha}$ atoms of protein 1C26 with alpha-helix structures. d. Facet persistence barcode for the $C_{\alpha}$ atoms of protein 2JOX with beta-sheet structures. e. $f$-vector curves for the $C_{\alpha}$ atoms of protein 2GR8. f. $f$-vector curves for the atoms in DNA structure 1BNA.