Table of Contents
Fetching ...

Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm

Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray Patterson

TL;DR

This work addresses encoding SMILES strings for supervised molecular tasks by constructing an optimal-transport–driven kernel. It converts SMILES to molecular graphs, derives 2048-bit Morgan fingerprints, builds a Gaussian pairwise distance matrix $\mathbf{D}$ with width $\sigma$, and obtains a final kernel $\mathbf{K}$ through Sinkhorn-Knopp updates that enforce marginals via $\mathbf{a}$ and $\mathbf{b}$ with $\mathbf{K} = \operatorname{diag}(\mathbf{a}) \mathbf{P} \operatorname{diag}(\mathbf{b})$. Kernel PCA projects $\mathbf{K}$ into a low-dimensional embedding used for drug-subcategory classification and solubility regression, enabling improved class separation and competitive predictive performance. Empirical results on DrugBank and ChEMBL show the SMILES kernel often surpasses baseline embeddings in classification while delivering competitive regression results, with inter-class heatmaps indicating clearer separation than Morgan fingerprints. Overall, the paper presents a promising kernel-based framework for molecular representation that can aid drug discovery and molecular design by capturing intrinsic structural information through optimal-transport–based relationships.

Abstract

In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.

Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm

TL;DR

This work addresses encoding SMILES strings for supervised molecular tasks by constructing an optimal-transport–driven kernel. It converts SMILES to molecular graphs, derives 2048-bit Morgan fingerprints, builds a Gaussian pairwise distance matrix with width , and obtains a final kernel through Sinkhorn-Knopp updates that enforce marginals via and with . Kernel PCA projects into a low-dimensional embedding used for drug-subcategory classification and solubility regression, enabling improved class separation and competitive predictive performance. Empirical results on DrugBank and ChEMBL show the SMILES kernel often surpasses baseline embeddings in classification while delivering competitive regression results, with inter-class heatmaps indicating clearer separation than Morgan fingerprints. Overall, the paper presents a promising kernel-based framework for molecular representation that can aid drug discovery and molecular design by capturing intrinsic structural information through optimal-transport–based relationships.

Abstract

In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Workflow of the proposed method.
  • Figure 2: The t-SNE plots using feature embedding for the Drug Bank dataset.
  • Figure 3: Heatmap for classes in DrugBank dataset for different drug subtypes. The figure is best seen in color.
  • Figure 4: Comparing two pairs of classes. (a) and (b) belong to different classes. The Gaussian kernel for (a) and (b) is 0.21 while for the proposed method is 0.17 (a smaller value is better) on DrugBank dataset. Bar plot where we used kernel PCA with k=100 (x-axis) and respective values (y-axis).