Predicting the Temperature Dependence of Surfactant CMCs Using Graph Neural Networks
Christoforos Brozos, Jan G. Rittig, Sandip Bhattacharya, Elie Akanny, Christina Kohlmann, Alexander Mitsos
TL;DR
This work addresses predicting the temperature dependence of surfactant CMCs across ionic, nonionic, zwitterionic, and sugar-based classes by developing an end-to-end GNN that incorporates temperature into the molecular fingerprint. An ensemble of GINEConv-based GNNs is trained on a dataset of $1{,}377$ CMC measurements from $492$ unique surfactants spanning $0^\circ$C to $90^\circ$C, with two evaluation schemes to test temperature extrapolation and generalization to unseen structures. The model achieves high predictive performance, with $R^2$ around $0.97$ for the different-temperature split and $0.94$ for the distinct-surfactant split, and RMSEs of about $0.173$ and $0.251$ (log CMC units), respectively, outperforming several prior temperature-independent approaches on larger, more diverse datasets. While the approach is robust, it underestimates some temperature sensitivities and exhibits class-dependent variability, especially for sugar-based surfactants, motivating more data and potentially geometry-aware GNNs, along with pH considerations and explainability in future work.
Abstract
The critical micelle concentration (CMC) of surfactant molecules is an essential property for surfactant applications in industry. Recently, classical QSPR and Graph Neural Networks (GNNs), a deep learning technique, have been successfully applied to predict the CMC of surfactants at room temperature. However, these models have not yet considered the temperature dependency of the CMC, which is highly relevant for practical applications. We herein develop a GNN model for temperature-dependent CMC prediction of surfactants. We collect about 1400 data points from public sources for all surfactant classes, i.e., ionic, nonionic, and zwitterionic, at multiple temperatures. We test the predictive quality of the model for following scenarios: i) when CMC data for surfactants are present in the training of the model in at least one different temperature, and ii) CMC data for surfactants are not present in the training, i.e., generalizing to unseen surfactants. In both test scenarios, our model exhibits a high predictive performance of R$^2 \geq $ 0.94 on test data. We also find that the model performance varies by surfactant class. Finally, we evaluate the model for sugar-based surfactants with complex molecular structures, as these represent a more sustainable alternative to synthetic surfactants and are therefore of great interest for future applications in the personal and home care industries.
