Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray Patterson
TL;DR
This work addresses encoding SMILES strings for supervised molecular tasks by constructing an optimal-transport–driven kernel. It converts SMILES to molecular graphs, derives 2048-bit Morgan fingerprints, builds a Gaussian pairwise distance matrix $\mathbf{D}$ with width $\sigma$, and obtains a final kernel $\mathbf{K}$ through Sinkhorn-Knopp updates that enforce marginals via $\mathbf{a}$ and $\mathbf{b}$ with $\mathbf{K} = \operatorname{diag}(\mathbf{a}) \mathbf{P} \operatorname{diag}(\mathbf{b})$. Kernel PCA projects $\mathbf{K}$ into a low-dimensional embedding used for drug-subcategory classification and solubility regression, enabling improved class separation and competitive predictive performance. Empirical results on DrugBank and ChEMBL show the SMILES kernel often surpasses baseline embeddings in classification while delivering competitive regression results, with inter-class heatmaps indicating clearer separation than Morgan fingerprints. Overall, the paper presents a promising kernel-based framework for molecular representation that can aid drug discovery and molecular design by capturing intrinsic structural information through optimal-transport–based relationships.
Abstract
In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.
