Table of Contents
Fetching ...

Graph Neural Networks for Surfactant Multi-Property Prediction

Christoforos Brozos, Jan G. Rittig, Sandip Bhattacharya, Elie Akanny, Christina Kohlmann, Alexander Mitsos

TL;DR

This work tackles the challenge of predicting two key surfactant properties, $CMC$ and $\Gamma_{m}$, across diverse classes using Graph Neural Networks. It builds the largest public $CMC$ dataset (429 molecules) and the first substantial $\Gamma_{m}$ dataset (164 molecules), then systematically evaluates single-task and multi-task learning, ensemble methods, and transfer learning. A multi-task GNN with ensemble learning trained on all data achieves the best performance, delivering accurate $CMC$ predictions and notably improved $\Gamma_{m}$ predictions by leveraging their relationship. The approach generalizes to industrial, unpurified surfactants, highlighting strong potential for industrial formulation while acknowledging limitations related to impurities and lack of stereochemical information in the training data.

Abstract

Surfactants are of high importance in different industrial sectors such as cosmetics, detergents, oil recovery and drug delivery systems. Therefore, many quantitative structure-property relationship (QSPR) models have been developed for surfactants. Each predictive model typically focuses on one surfactant class, mostly nonionics. Graph Neural Networks (GNNs) have exhibited a great predictive performance for property prediction of ionic liquids, polymers and drugs in general. Specifically for surfactants, GNNs can successfully predict critical micelle concentration (CMC), a key surfactant property associated with micellization. A key factor in the predictive ability of QSPR and GNN models is the data available for training. Based on extensive literature search, we create the largest available CMC database with 429 molecules and the first large data collection for surface excess concentration ($Γ$$_{m}$), another surfactant property associated with foaming, with 164 molecules. Then, we develop GNN models to predict the CMC and $Γ$$_{m}$ and we explore different learning approaches, i.e., single- and multi-task learning, as well as different training strategies, namely ensemble and transfer learning. We find that a multi-task GNN with ensemble learning trained on all $Γ$$_{m}$ and CMC data performs best. Finally, we test the ability of our CMC model to generalize on industrial grade pure component surfactants. The GNN yields highly accurate predictions for CMC, showing great potential for future industrial applications.

Graph Neural Networks for Surfactant Multi-Property Prediction

TL;DR

This work tackles the challenge of predicting two key surfactant properties, and , across diverse classes using Graph Neural Networks. It builds the largest public dataset (429 molecules) and the first substantial dataset (164 molecules), then systematically evaluates single-task and multi-task learning, ensemble methods, and transfer learning. A multi-task GNN with ensemble learning trained on all data achieves the best performance, delivering accurate predictions and notably improved predictions by leveraging their relationship. The approach generalizes to industrial, unpurified surfactants, highlighting strong potential for industrial formulation while acknowledging limitations related to impurities and lack of stereochemical information in the training data.

Abstract

Surfactants are of high importance in different industrial sectors such as cosmetics, detergents, oil recovery and drug delivery systems. Therefore, many quantitative structure-property relationship (QSPR) models have been developed for surfactants. Each predictive model typically focuses on one surfactant class, mostly nonionics. Graph Neural Networks (GNNs) have exhibited a great predictive performance for property prediction of ionic liquids, polymers and drugs in general. Specifically for surfactants, GNNs can successfully predict critical micelle concentration (CMC), a key surfactant property associated with micellization. A key factor in the predictive ability of QSPR and GNN models is the data available for training. Based on extensive literature search, we create the largest available CMC database with 429 molecules and the first large data collection for surface excess concentration (), another surfactant property associated with foaming, with 164 molecules. Then, we develop GNN models to predict the CMC and and we explore different learning approaches, i.e., single- and multi-task learning, as well as different training strategies, namely ensemble and transfer learning. We find that a multi-task GNN with ensemble learning trained on all and CMC data performs best. Finally, we test the ability of our CMC model to generalize on industrial grade pure component surfactants. The GNN yields highly accurate predictions for CMC, showing great potential for future industrial applications.
Paper Structure (18 sections, 7 figures, 6 tables)

This paper contains 18 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Statistical overview of CMC and $\Gamma$$_{m}$ databases, assembled from literature in this work. Normal distribution of the data set would ensure an even train-validation-test split without artifact.
  • Figure 2: Correlation plot between $\log$ CMC and $\Gamma$$_{m}$ for all surfactant classes. In the plot, 141 surfactants are presented, for which both CMC and $\Gamma$$_{m}$ data was collected from the literature.
  • Figure 3: A surfactant molecule represented as undericted graph. In every node (atom) a feature vector (black-white color) of size 30 and in every edge (bond) a feature vector (green-white) of size 12 is assigned. The feature vectors encode chemical information about each individual atom and edge respectively. As can be seen in the atom features vector, different atoms have different entries, which distinguishes them from the other atoms. Similarly, bonds are distinguished through their feaure vector too.
  • Figure 4: Multi-task GNN ensemble models for (a) CMC and (b) $\Gamma$$_{m}$ for all surfactant classes. The light red dashed lines represent the 10% error and the dark red dashed lines the 20% error.
  • Figure 5: Outliers in multi-task learning for CMC prediction. Two of them, (a) and (b) belong to nonionic class and the third, (c) to anionic class.
  • ...and 2 more figures