Table of Contents
Fetching ...

Hierarchical Matrix Completion for the Prediction of Properties of Binary Mixtures

Dominik Gond, Jan-Tobias Sohns, Heike Leitte, Hans Hasse, Fabian Jirasek

TL;DR

This work demonstrates how classes can reproducibly be defined based on mixture data alone by agglomerative clustering and demonstrates the benefits of this approach by applying it in connection with a matrix completion method (MCM) for predicting isothermal activity coefficients at infinite dilution in binary mixtures.

Abstract

Predicting the thermodynamic properties of mixtures is crucial for process design and optimization in chemical engineering. Machine learning (ML) methods are gaining increasing attention in this field, but experimental data for training are often scarce, which hampers their application. In this work, we introduce a novel generic approach for improving data-driven models: inspired by the ancient rule "similia similibus solvuntur", we lump components that behave similarly into chemical classes and model them jointly in the first step of a hierarchical approach. While the information on class affiliations can stem in principle from any source, we demonstrate how classes can reproducibly be defined based on mixture data alone by agglomerative clustering. The information from this clustering step is then used as an informed prior for fitting the individual data. We demonstrate the benefits of this approach by applying it in connection with a matrix completion method (MCM) for predicting isothermal activity coefficients at infinite dilution in binary mixtures. Using clustering leads to significantly improved predictions compared to an MCM without clustering. Furthermore, the chemical classes learned from the clustering give exciting insights into what matters on the molecular level for modeling given mixture properties.

Hierarchical Matrix Completion for the Prediction of Properties of Binary Mixtures

TL;DR

This work demonstrates how classes can reproducibly be defined based on mixture data alone by agglomerative clustering and demonstrates the benefits of this approach by applying it in connection with a matrix completion method (MCM) for predicting isothermal activity coefficients at infinite dilution in binary mixtures.

Abstract

Predicting the thermodynamic properties of mixtures is crucial for process design and optimization in chemical engineering. Machine learning (ML) methods are gaining increasing attention in this field, but experimental data for training are often scarce, which hampers their application. In this work, we introduce a novel generic approach for improving data-driven models: inspired by the ancient rule "similia similibus solvuntur", we lump components that behave similarly into chemical classes and model them jointly in the first step of a hierarchical approach. While the information on class affiliations can stem in principle from any source, we demonstrate how classes can reproducibly be defined based on mixture data alone by agglomerative clustering. The information from this clustering step is then used as an informed prior for fitting the individual data. We demonstrate the benefits of this approach by applying it in connection with a matrix completion method (MCM) for predicting isothermal activity coefficients at infinite dilution in binary mixtures. Using clustering leads to significantly improved predictions compared to an MCM without clustering. Furthermore, the chemical classes learned from the clustering give exciting insights into what matters on the molecular level for modeling given mixture properties.
Paper Structure (13 sections, 6 equations, 6 figures)

This paper contains 13 sections, 6 equations, 6 figures.

Figures (6)

  • Figure 1: Scheme that illustrates how an MCM can be improved by clustering -- without any additional data. (1) A data-driven MCM is trained on the available experimental data on $\ln \gamma_{ij,\mathrm{exp}}^\infty$. (2) The obtained completed matrix of $\ln \gamma_{ij,\mathrm{pred}}^\infty$ is fed into the agglomerative clustering algorithm. (3) The resulting dendrograms are used to define component classes based on similarity regarding mixture behavior. (4) The class affiliations and experimental data are used in a hierarchical MCM, providing more precise predictions for $\ln \gamma_{ij}^\infty$.
  • Figure 2: Schematic representation of the hierarchical matrix completion method (hMCM). Class-specific LVs are modeled as variational distributions (grey) drawn from hyperprior distributions for solute (blue) and solvent (red) classes. The class-specific LVs are used to define conditional, class-specific priors, from which component-specific LVs are drawn, analogous to the training of sMCM. The likelihood (green) models how the component-specific LVs explain the training data $\ln\gamma_{ij\mathrm{,exp}}^\infty$ (green). After the training, the posterior distribution (green) for the final component-specific LVs is obtained. $\bm{A}_{r}$ and $\bm{B}_{s}$ denote vectors containing class-specific LVs, $\bm{u}_{i}$ and $\bm{v}_{j}$ denote vectors containing component-specific LVs. The variables are explained in the text.
  • Figure 3: Matrix of $\ln\gamma_{ij}^\infty$ in binary mixtures predicted by sMCM jirasek2020machine. Left: rows (solutes) and columns (solvents) are sorted according to the component identifier from the Dortmund Data Bank (DDB). Right: rows and columns are sorted according to similarities in $\ln\gamma_{ij}^\infty$, obtained by hierarchical agglomerative clustering. The color code indicates the values of $\ln\gamma_{ij}^\infty$. Two distinct blocks in the sorted matrix are marked and discussed in the text as examples.
  • Figure 4: Dendrograms of solutes (top) and solvents (bottom) resulting from the hierarchical clustering based on $\ln\gamma_{ij}^\infty$. The vertical axes represent the dissimilarity between clusters quantified by Euclidean distance and 'complete' linkage, while the horizontal axes show the individual components. Different colors indicate distinctions between clusters, and different color shades indicate the classes defined through visual analysis.
  • Figure 5: Mean absolute error (MAE) and mean squared error (MSE) of the developed hMCM for predicting $\ln\gamma_{ij}^\infty$ at 298.15K in binary mixtures and comparison to the data-driven sMCM jirasek2020machine and the physical gold standard UNIFAC(Do) modUNIFAC. The complete horizon covers our complete data set; the UNIFAC horizon includes only the mixtures that UNIFAC(Do) can describe. Error bars indicate the standard errors of the means.
  • ...and 1 more figures