Table of Contents
Fetching ...

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

Ella Miray Rajaonson, Mahyar Rajabi Kochi, Luis Martin Mejia Mendoza, Seyed Mohamad Moosavi, Benjamin Sanchez-Lengeling

TL;DR

CheMixHub presents the first large-scale, unified benchmark for chemical mixture property prediction, aggregating 11 tasks from 7 datasets to enable robust evaluation of models across diverse multi-component systems. It defines a three-level modeling space (molecular representation, mixture interactions, and output generation) and introduces multiple data splits to test generalization, including mixture-size, leave-molecule-out, and temperature-based splits. Key findings show that physics-informed Arrhenius heads improve temperature-dependent predictions and that pre-trained representations (e.g., MolT5) often outperform traditional GNNs and descriptors, though extrapolation to new chemistries remains challenging. The work provides open-source data, code, and a framework to foster progress in mixture-aware modeling and formulation optimization, with clear limitations and directions for future research.

Abstract

Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

TL;DR

CheMixHub presents the first large-scale, unified benchmark for chemical mixture property prediction, aggregating 11 tasks from 7 datasets to enable robust evaluation of models across diverse multi-component systems. It defines a three-level modeling space (molecular representation, mixture interactions, and output generation) and introduces multiple data splits to test generalization, including mixture-size, leave-molecule-out, and temperature-based splits. Key findings show that physics-informed Arrhenius heads improve temperature-dependent predictions and that pre-trained representations (e.g., MolT5) often outperform traditional GNNs and descriptors, though extrapolation to new chemistries remains challenging. The work provides open-source data, code, and a framework to foster progress in mixture-aware modeling and formulation optimization, with clear limitations and directions for future research.

Abstract

Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub

Paper Structure

This paper contains 44 sections, 19 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: CheMixHub: A benchmark for chemical mixture property prediction.. (Left) Illustrates a sample mixture input, including components and conditions. (Center) Highlights potential applications enabled by CheMixHub, such as reformulation, optimization, and discovery through property prediction. (Right) Summarizes CheMixHub's structure: 11 tasks, 4 data split types, and a multi-level modeling baselines for comprehensive evaluation and development.
  • Figure 2: Diversity of Chemical Structures and Mixture Compositions in CheMixHub. (Left) t-SNE visualization of the molecular structural diversity, with points colored by their source dataset. (Right) Histogram showing the percentage of mixtures based on their number of components.
  • Figure 3: Mapping out the deep learning modeling space for chemical mixtures. We highlight three levels: (1) molecular representation and context infusion (e.g., molecular fraction), (2) mixture-level interaction aggregation, and infusion of global mixture context (e.g., temperature), (3) property output generation, each offering distinct avenues for model development.
  • Figure 4: Generalization to new mixture sizes and molecules. For each dataset: (Left) Ablation study with training data only containing mixtures with (geometric) average number of molecules less than a threshold. The thresholds are indicated for each split. (Right) Boxplot of the best deep learning model test Pearson correlation on random CV splits, and the LMO splits.
  • Figure 5: The embedding space of salts and fragments in CheMixHub. UMAP projection of the combined RDKit 2D descriptor space (200 dimensions) for salts and fragments. The embedding reveals well-defined structural clusters with apparent separation between salts and fragments, rather than overlap. Most salts appear in peripheral regions relative to the fragment clusters, suggesting distinct structural patterns at the descriptor level.
  • ...and 1 more figures