Table of Contents
Fetching ...

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

Toby Boyne, Juan S. Campos, Becky D. Langdon, Jixiang Qing, Yilin Xie, Shiqiang Zhang, Calvin Tsay, Ruth Misener, Daniel W. Davies, Kim E. Jelfs, Sarah Boyall, Thomas M. Dixon, Linden Schrecker, Jose Pablo Folch

TL;DR

The paper presents the first ML-ready transient-flow dataset for chemistry, enabling benchmarking of yield prediction under continuous reaction conditions with emphasis on solvent effects. It introduces a solvent-ramping data collection framework, open data on Kaggle, and a suite of ML benchmarks (solvent featurization, GP extensions, transfer learning, and active learning) to address low-data and dynamic regimes. Key findings show that solvent-aware featurization (Spange) and GP-based methods offer strong baselines, while mixture representations and non-stationary modeling require further development; active learning and MOBO demonstrate efficient experimental design. The work highlights practical implications for solvent replacement and sustainable manufacturing, and calls for broader data resources and priors to advance chemistry-aware ML models.

Abstract

Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

TL;DR

The paper presents the first ML-ready transient-flow dataset for chemistry, enabling benchmarking of yield prediction under continuous reaction conditions with emphasis on solvent effects. It introduces a solvent-ramping data collection framework, open data on Kaggle, and a suite of ML benchmarks (solvent featurization, GP extensions, transfer learning, and active learning) to address low-data and dynamic regimes. Key findings show that solvent-aware featurization (Spange) and GP-based methods offer strong baselines, while mixture representations and non-stationary modeling require further development; active learning and MOBO demonstrate efficient experimental design. The work highlights practical implications for solvent replacement and sustainable manufacturing, and calls for broader data resources and priors to advance chemistry-aware ML models.

Abstract

Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.

Paper Structure

This paper contains 30 sections, 16 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Data was gathered on the rearrangement of allyl substituted catechol. By subjecting the reaction mixture to high temperatures, we begin a cascade reaction forming multiple rearrangement products. We investigate the yield of the reaction for a range of different solvents. Product 1 was not observed and reacted immediately to form Product 2 and later 3.
  • Figure 2: Visual summary of the data set. (a) Showcases the solvent space covered. (b) A full 8h experimental run between two solvents. (c) A residence time ramp, showing the starting material and product yields. (d) A solvent ramp, showing the yields under solvent mixture conditions.
  • Figure 3: Example of a residence-time ramp in a transient flow reactor. (Left) We decrease the flow rate of the reactor to begin the experiments. (Middle) The residence time experienced by the flow at the point of measurement. (Right) Product yield mapped against residence time of measurements.
  • Figure 4: GP prediction on the yields of a solvent ramp, using Spange descriptors. We showcase a comparison between the baseline Gaussian process and a standard one. 2-MeTHF appears in another ramp, and so the model is confident about its predictions; as the proportion of Ether increases, so too does the model uncertainty.
  • Figure 5: Results of benchmarking for active learning and BO. We initialize the GP with 5 random samples, and show results over 30 initializations. We report the median, 10th and 90th quantiles.
  • ...and 3 more figures