Table of Contents
Fetching ...

Digital Surfactant

Sayeedul I. Sheikh, V. Subhasree Navya, Riya Sharma, Sudip Roy, Jayant K. Singh

TL;DR

The paper addresses the rapid design of non-ionic surfactants with tailored properties using two complementary data-driven approaches: a graph diffusion-based inverse design model (GraphDiT) and a transformer-based molecular optimization model. It leverages the SurfPro dataset and AttentiveFP predictors to generate and screen candidate surfactants, validating selected molecules with DFT and MD simulations to confirm property satisfaction. Diffusion-based generation offers broader chemical exploration and diversity, while transformer-based optimization yields tighter adherence to target properties, albeit with reduced space coverage. The combined workflow demonstrates that predicted properties align well with MD results, supporting the viability of AI-driven surfactant discovery and guiding future work toward broader chemistries and richer property sets.

Abstract

Surfactants play an important role in determining the cleaning performance and stability of detergents. However, the design of new surfactants using traditional methods is often time-consuming, complex, and largely based on trial and error. Recent studies have incorporated data-driven and computational approaches to generate new surfactants and predict properties of surfactants, but most of these approaches either optimize on a single property or train on a small number of surfactants. In this work, we investigate the generative capabilities of an existing graph diffusion based inverse design model and a transformer based molecule optimization model, for non-ionic surfactants. We train both models to generate non-ionic surfactants based on single- and multi-property values, predict the same properties o using trained property predictor models for generated molecules, and validate a few them using molecular dynamics simulations. Our results reveal that the inverse design model is better at generating a diverse set of molecules, while the transformer is better at generating molecules which satisfy input property constraints better. We also observe that molecules generated using single property condition, on average, satisfy the input property condition better when compared to molecules generated using multiple property conditions. From molecular dynamics simulations, we observe that the predicted properties of the selected molecules are close to the simulated results concluding that both methods are capable of generating surfactants that actually satisfy input property condtions.

Digital Surfactant

TL;DR

The paper addresses the rapid design of non-ionic surfactants with tailored properties using two complementary data-driven approaches: a graph diffusion-based inverse design model (GraphDiT) and a transformer-based molecular optimization model. It leverages the SurfPro dataset and AttentiveFP predictors to generate and screen candidate surfactants, validating selected molecules with DFT and MD simulations to confirm property satisfaction. Diffusion-based generation offers broader chemical exploration and diversity, while transformer-based optimization yields tighter adherence to target properties, albeit with reduced space coverage. The combined workflow demonstrates that predicted properties align well with MD results, supporting the viability of AI-driven surfactant discovery and guiding future work toward broader chemistries and richer property sets.

Abstract

Surfactants play an important role in determining the cleaning performance and stability of detergents. However, the design of new surfactants using traditional methods is often time-consuming, complex, and largely based on trial and error. Recent studies have incorporated data-driven and computational approaches to generate new surfactants and predict properties of surfactants, but most of these approaches either optimize on a single property or train on a small number of surfactants. In this work, we investigate the generative capabilities of an existing graph diffusion based inverse design model and a transformer based molecule optimization model, for non-ionic surfactants. We train both models to generate non-ionic surfactants based on single- and multi-property values, predict the same properties o using trained property predictor models for generated molecules, and validate a few them using molecular dynamics simulations. Our results reveal that the inverse design model is better at generating a diverse set of molecules, while the transformer is better at generating molecules which satisfy input property constraints better. We also observe that molecules generated using single property condition, on average, satisfy the input property condition better when compared to molecules generated using multiple property conditions. From molecular dynamics simulations, we observe that the predicted properties of the selected molecules are close to the simulated results concluding that both methods are capable of generating surfactants that actually satisfy input property condtions.

Paper Structure

This paper contains 18 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Visualizing and comparing the distribution of generated molecules with the original training set by reducing the output dimensions of ChemBERTa embeddings. Figure \ref{['fig:tsne(a)']} shows the space of generated molecules and training set using diffusion while Figure \ref{['fig:tsne(b)']} shows the same but the molecules are generated using transformer models. The molecules that are generated using only $\text{pCMC}$ values are shown in red in the both figures while molecules generated using multi-property constraints are shown in blue. Grey is used to visualize the embeddings of the molecules that belong to the trainign set.
  • Figure 2: Morgan fingerprints were computed for each generated molecule using RDKit's GetMorganFingerprintasBitVec function setting radius to 2 and nbits to 2048. These fingerprints were used to compute the tanimoto similarity scores between each of the generated molecules for a particular method. The distribution of these scores is shown in the above figures with red indicating the distribution of scores for molecules generated using only on $\text{pCMC}$ values while blue shows the distribution of scores molecules generated using all three property values.
  • Figure 3: Distribution of heavy atoms in the generated set for different methods with red representing the distribution for molecules generated using $\text{pCMC}$ values and blue representing the molecules generated using all three property values, compared to the training set shown in grey.
  • Figure 4: Distribution of unique bonds in the generated set for different methods with red representing the distribution for molecules generated using $\text{pCMC}$ values and blue representing the molecules generated using all three property values, compared to the training set shown in grey.
  • Figure 5: Comparison of predicted vs target $\text{pCMC}$ values for final screened molecules, using trained property predictor models. Red represents the molecules generated using only $\text{pCMC}$ values while
  • ...and 2 more figures