Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

Siyu Lou; Chengchun Liu; Yuntian Chen; Fanyang Mo

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

Siyu Lou, Chengchun Liu, Yuntian Chen, Fanyang Mo

TL;DR

This work introduces Unsupervised Hierarchical Symbolic Regression (UHiSR), combining hierarchical neural networks and symbolic regression, which automatically distills chemical-intuitive polarity indices, and discovers interpretable equations that link molecular structure to chromatographic behavior.

Abstract

Thin-layer chromatography (TLC) is a crucial technique in molecular polarity analysis. Despite its importance, the interpretability of predictive models for TLC, especially those driven by artificial intelligence, remains a challenge. Current approaches, utilizing either high-dimensional molecular fingerprints or domain-knowledge-driven feature engineering, often face a dilemma between expressiveness and interpretability. To bridge this gap, we introduce Unsupervised Hierarchical Symbolic Regression (UHiSR), combining hierarchical neural networks and symbolic regression. UHiSR automatically distills chemical-intuitive polarity indices, and discovers interpretable equations that link molecular structure to chromatographic behavior.

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 6 figures, 17 tables)

This paper contains 29 sections, 6 equations, 6 figures, 17 tables.

Introduction
Results
Discussion
Materials and Methods
Acknowledgments
Supplementary Materials

Figures (6)

Figure 1: Overview of Unsupervised Hierarchical Symbolic Regression (UHiSR).(A) Illustration of TLC experiment and the calculation of the retardation factor ($R_f)$. (B) Feature engineering, involving five solvent features based on volume percentages and the decomposition of target molecules into functional groups (FG). The molecular structure is treated as a composite formed by stacking various functional group modules. (C) UHiSR framework with three main stages: chemist-guided feature clustering, hierarchical neural network for latent variable extraction (e.g., solute polarity index), and symbolic regression for discovering explicit equations between the target value and latent variables.
Figure 2: Comparative analysis of different molecular feature sets: the fitting accuracy is evaluated using the XGBoost model across three different feature groups. These feature groups comprise: (A) Features introduced in this paper, detailed in Table \ref{['tab:feature']}; (B) MACCS keys; (C) physicochemical descriptors.
Figure 3: Illustration of the polarity indices and their impact on chromatographic behavior. (A) The input features corresponds to different polarity indices. Here, FG stands for functional group. (B) TLC experiment can be understood as a process where the stationary phase (silica gel) and the mobile phase (solvent) compete for the solute molecules. This competition's outcome is reflected in the $R_f$ value. The factors influencing the $R_f$ value can be categorized into three types: interaction between the stationary phase and the solvent (I), interaction between the stationary phase and the solute (II), and interaction between the solvent and the solute (III). The two polarity indices ($\Psi$ and $\xi)$ characterize how solvent and solute impact the chromatographic behavior separately. (C) Illustration of interactions between the stationary phase and different functional groups.
Figure 4: Visualization of the latent variables and the decomposition of the retrieved formula. (A)Visualization of the observed and calculated $R_f$ values with two polarity indices $\Psi$ and $\xi$. (B) Fitting observed Rf values with a Sigmoid function. (C) Decomposition of Equation (1) into $h(\Psi)$ and $g(\xi)$. (D) Decomposition of Equation (3) into $f_1(\alpha)$ and $f_2(\beta)$. (E) FG distribution polarity index $\alpha$ of 10 example compounds.
Figure 5: Hierarchical structure of learning latent variables. Three stages are organized from top to bottom. (A) At the first stage, two latent variables, $\Psi$ and $\xi$, are generated to encapsulate the overall polarity of the solvent and the solute, respectively. (B) At the second stage, two additional latent variables, $\alpha$ and $\beta$, are learned to assess the solute molecule's polarity from two distinct perspectives. $\alpha$ is linked to the distribution of functional groups, while $\beta$ pertains to the quantity of individual functional groups. (C) The third stage characterizes five latent variables, each representing the impact of specific groups of functional groups.
...and 1 more figures

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

TL;DR

Abstract

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

Authors

TL;DR

Abstract

Table of Contents

Figures (6)