Table of Contents
Fetching ...

Thermodynamically consistent machine learning model for excess Gibbs energy

Marco Hoffmann, Thomas Specht, Quirin Göttl, Jakob Burger, Stephan Mandt, Hans Hasse, Fabian Jirasek

TL;DR

HANNA addresses the challenge of predicting the thermodynamic excess Gibbs energy $g^\mathrm{E}$ of multi-component liquid mixtures from molecular structure while enforcing thermodynamic constraints. It combines transformer-based molecular embeddings with a hard-constraint neural network and a geometric Muggianu projection to extend binary subsystem predictions to arbitrary numbers of components, and uses a differentiable surrogate solver to enable end-to-end training on LLE and HE data. The model is trained on extensive binary data sets (VLE TPXY, TPX, ACI, HE) from the Dortmund Data Bank and demonstrates superior accuracy relative to state-of-the-art UNIFAC-based models for binary and ternary systems, including ionic liquids, with broad applicability. The authors release the full model and code openly and provide an interactive interface to facilitate integration into process design and materials discovery.

Abstract

The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from molecular structures is a long-standing challenge. We address this challenge with HANNA, a flexible machine learning model for excess Gibbs energy that integrates physical laws as hard constraints, guaranteeing thermodynamically consistent predictions. HANNA is trained on experimental data for vapor-liquid equilibria, liquid-liquid equilibria, activity coefficients at infinite dilution and excess enthalpies in binary mixtures. The end-to-end training on liquid-liquid equilibrium data is facilitated by a surrogate solver. A geometric projection method enables robust extrapolations to multi-component mixtures. We demonstrate that HANNA delivers accurate predictions, while providing a substantially broader domain of applicability than state-of-the-art benchmark methods. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.

Thermodynamically consistent machine learning model for excess Gibbs energy

TL;DR

HANNA addresses the challenge of predicting the thermodynamic excess Gibbs energy of multi-component liquid mixtures from molecular structure while enforcing thermodynamic constraints. It combines transformer-based molecular embeddings with a hard-constraint neural network and a geometric Muggianu projection to extend binary subsystem predictions to arbitrary numbers of components, and uses a differentiable surrogate solver to enable end-to-end training on LLE and HE data. The model is trained on extensive binary data sets (VLE TPXY, TPX, ACI, HE) from the Dortmund Data Bank and demonstrates superior accuracy relative to state-of-the-art UNIFAC-based models for binary and ternary systems, including ionic liquids, with broad applicability. The authors release the full model and code openly and provide an interactive interface to facilitate integration into process design and materials discovery.

Abstract

The excess Gibbs energy plays a central role in chemical engineering and chemistry, providing a basis for modeling thermodynamic properties of liquid mixtures. Predicting the excess Gibbs energy of multi-component mixtures solely from molecular structures is a long-standing challenge. We address this challenge with HANNA, a flexible machine learning model for excess Gibbs energy that integrates physical laws as hard constraints, guaranteeing thermodynamically consistent predictions. HANNA is trained on experimental data for vapor-liquid equilibria, liquid-liquid equilibria, activity coefficients at infinite dilution and excess enthalpies in binary mixtures. The end-to-end training on liquid-liquid equilibrium data is facilitated by a surrogate solver. A geometric projection method enables robust extrapolations to multi-component mixtures. We demonstrate that HANNA delivers accurate predictions, while providing a substantially broader domain of applicability than state-of-the-art benchmark methods. The trained model and corresponding code are openly available, and an interactive interface is provided on our website, MLPROP.

Paper Structure

This paper contains 9 sections, 25 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the HANNA prediction framework and its training procedure. a, As input, HANNA requires the molecular structure of all components in the SMILES Weininger1988 notation, the composition of the mixture $\boldsymbol{x}$, and the temperature $T$. In the first step, molecular embeddings are generated from the SMILES using the transformer model ChemBERTa-2 Ahmad2022. Subsequently, a hard-constraint neural network that strictly obeys all thermodynamic constraints predicts $g^\mathrm{E}_{ij}$ of all binary subsystems, from which $g^\mathrm{E}$ in the multi-component mixture is calculated via a geometric projection Muggianu1975. Thermodynamically consistent activity coefficients $\gamma_i$ are obtained by automatic differentiation of $g^\mathrm{E}$ and then used to predict phase equilibria using the convex envelope method (CEM) Ryll2012Goettl2023Goettl2025. b, HANNA is trained on experimental vapor-liquid equilibrium (VLE), liquid-liquid equilibrium (LLE), activity coefficients at infinite dilution (ACI), and excess enthalpy (HE) data of binary mixtures. End-to-end training on the experimental LLE data is enabled by a new surrogate solver using the Gibbs energy of mixing $\Delta g_\mathrm{mix}$. An additional Gibbs loss incentivizes the model to produce negative second derivatives of $\Delta g_\mathrm{mix}$ in the LLE regime, and a Lipschitz regularization enforces smoother $g^\mathrm{E}$ functions.
  • Figure 2: Predictions for binary mixtures with HANNA. a, Boxplots comparing the performance of HANNA for predicting activity coefficients in binary VLE (top), ACI (middle), and LLE phase compositions (bottom) with that of mod. UNIFAC in terms of the system-wise mean absolute error $\mathrm{MAE}_\mathrm{sys}$ in $\ln\gamma_i$ (VLE, ACI) or phase compositions $x_i^\prime$ and $x_i^{\prime\prime}$ (LLE). For a fair comparison, HANNA was also evaluated only on those systems for which mod. UNIFAC is applicable (mod. UNIFAC horizon). $N_\mathrm{sys}$ denotes the number of test systems within the respective horizon and data type. The boxes represent interquartile ranges, and the whiskers are 1.5 times the interquartile range. Diamonds mark the mean, horizontal lines the median of the $\mathrm{MAE}_\mathrm{sys}$ values. b, Predictions of HANNA for three isothermal VLE plotted against experimental data. Open symbols denote experimental data, and lines are predictions by HANNA. Blue indicates the liquid phase; red indicates the vapor phase. The molecular structures of the components and the respective $\mathrm{MAE}_\mathrm{sys}$ are depicted in the plots; the left molecule corresponds to component 1. c, Predictions of HANNA for three LLE. Open squares indicate experimental data, lines represent predictions by HANNA. Predicted upper and lower critical solution temperatures are marked with an 'x'. d, Predictions of HANNA for three isobaric heteroazeotropes. Open blue and red circles indicate the experimental liquid and vapor phases, respectively, from VLE data. Open blue squares mark the experimental phase compositions from LLE data; the green line is the vapor-liquid-liquid equilibrium predicted by HANNA. All results shown here are for systems from the test set that were not used for HANNA training.
  • Figure 3: Predictions for ternary mixtures with HANNA. a, Boxplots comparing the performance of HANNA for predicting activity coefficients in ternary VLE (top), ACI (middle), and LLE phase compositions (bottom) with that of mod. UNIFAC in terms of the system-wise mean absolute error $\mathrm{MAE}_\mathrm{sys}$ in $\ln\gamma_i$ (VLE, ACI) or phase compositions $x_i^\prime$ and $x_i^{\prime\prime}$ (LLE). For a fair comparison, HANNA was also evaluated only on those systems for which mod. UNIFAC is applicable (mod. UNIFAC horizon). $N_\mathrm{sys}$ denotes the number of test systems within the respective horizon and data type. The boxes represent interquartile ranges, and the whiskers are 1.5 times the interquartile range. Diamonds mark the mean, horizontal lines the median of the $\mathrm{MAE}_\mathrm{sys}$ values. b, Predictions of HANNA for two isothermal VLE. Open blue and red symbols mark the experimental liquid and vapor phase compositions, respectively; filled red symbols are the predicted vapor phase compositions. The molecular structures of the components and the respective $\mathrm{MAE}_\mathrm{sys}$ are also depicted. c, Predictions of HANNA for three isothermal LLE. Open symbols and dashed tie lines represent the experimental data; filled symbols and solid lines represent predictions by HANNA. All results shown here are for systems from the test set that were not used for HANNA training.