Table of Contents
Fetching ...

A Unified Approach to Inferring Chemical Compounds with the Desired Aqueous Solubility

Muniba Batool, Naveed Ahmed Azam, Jianshen Zhu, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

Abstract

Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR) and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving the accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1204] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at https://github.com/ku-dml/mol-infer/tree/master/AqSol.

A Unified Approach to Inferring Chemical Compounds with the Desired Aqueous Solubility

Abstract

Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR) and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving the accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1204] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at https://github.com/ku-dml/mol-infer/tree/master/AqSol.
Paper Structure (11 sections, 3 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 11 sections, 3 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the our approach to inferring a chemical graph with the desired AS.
  • Figure 2: (a) Representation of the chemical compound 3-(3-Ethylcyclopentyl) propanoic acid with CID = 20849290 as a chemical graph $\mathbb{C}$; (b) The vertices and edges of the interior and exterior parts of $\mathbb{C}$ depicted with black and gray colors, respectively, in the two-layered model. The sets of interior and exterior vertices are $\{u_1, u_2, \ldots, u_7\}$ and $\{w_1, w_2, \ldots, w_5\}$, respectively.
  • Figure 3: The 2-fringe trees $\mathbb{C}[u_i]$, $i \in [1,7]$ of the example $\mathbb{C}$ in Figure \ref{['fig:e']}(a) rooted at $u_i$.
  • Figure 4: (a) An illustration of seed graph $G_\mathbb{C}$ for chemical graph given in Figure \ref{['fig:e']}(a) with a typical edge depicted by a dashed line; (b) A set $\mathcal{F} = \{\psi_1, \psi_2, \ldots, \psi_9\}$ of chemical rooted trees, where hydrogen atoms with non-root vertex are omitted.
  • Figure 5: (i)-(vii), (viii)-(xiv), (xv)-(xxi), and (xxii)-(xxviii) inferred chemical graphs using the datasets Jain, Duffy, Wang and Phys, respectively.