Table of Contents
Fetching ...

CHAOS - A Consistent Large-scale Database for Sigma-Profiles and Other Molecular Descriptors

Dominik Gond, Justus Arweiler, Thomas Specht, Hans Hasse, Fabian Jirasek

TL;DR

The paper addresses the fragmentation and inconsistency of publicly available sigma-profile data by introducing CHAOS, a large-scale, internally consistent database linking sigma-profiles to a broad set of quantum-chemical descriptors for 53,091 molecules. It uses a uniform workflow (RDKit for geometry and conformer sampling, CREST for refinement, and Gaussian 16 with omegaB97X-D/def2-TZVP for high-level calculations) to generate gas-phase structures, vibrational spectra, NMR shielding tensors, and C-PCM solvation data, together with precomputed COSMO-SAC-dsp sigma-profiles. The database comprises six descriptor blocks per molecule and provides raw COSMO segments, enabling custom sigma-profile construction, with data openly available under CC-BY-4.0. CHAOS thus offers a versatile resource for physics-based and data-driven modeling of thermophysical properties, solvent design, and materials science, and sets the stage for future expansion to condensed-phase and reactive systems.

Abstract

Sigma-profiles obtained from quantum-chemical calculations are key molecular descriptors for solvent selection, thermodynamic modeling, and data-driven molecular design. However, existing sigma-profile libraries are limited in size and inconsistent in quality, which restricts their utility. In this work, we introduce CHAOS (Computed High-Accuracy Observables and Sigma Profiles), a large-scale and internally consistent database providing sigma-profiles for 53091 molecules, along with additional quantum-chemical observables including gas-phase geometries, single-point conductor-like polarizable continuum (C-PCM) data, infrared spectra, ideal-gas heat capacities and entropies, and atomic orbital nuclear magnetic resonance (NMR) shielding tensors. All data were generated using a standardized quantum-chemical workflow based on an wB97X-D/def2-TZVP level of theory. The CHAOS database covers molecules composed of a diverse set of elements, with molar masses up to 400 amu and dipole moments up to 15 D, and is freely available on Zenodo under an open license. It extends the number of molecules for which sigma-profiles are publicly available by more than an order of magnitude and systematically links them to a broad range of other quantum-chemical molecular descriptors. CHAOS provides a comprehensive and consistent foundation for developing models of molecular and thermodynamic properties -- both physics-based and machine-learning approaches -- across chemistry, chemical engineering, and materials science, greatly extending the possibilities and the available quantum-chemical data basis.

CHAOS - A Consistent Large-scale Database for Sigma-Profiles and Other Molecular Descriptors

TL;DR

The paper addresses the fragmentation and inconsistency of publicly available sigma-profile data by introducing CHAOS, a large-scale, internally consistent database linking sigma-profiles to a broad set of quantum-chemical descriptors for 53,091 molecules. It uses a uniform workflow (RDKit for geometry and conformer sampling, CREST for refinement, and Gaussian 16 with omegaB97X-D/def2-TZVP for high-level calculations) to generate gas-phase structures, vibrational spectra, NMR shielding tensors, and C-PCM solvation data, together with precomputed COSMO-SAC-dsp sigma-profiles. The database comprises six descriptor blocks per molecule and provides raw COSMO segments, enabling custom sigma-profile construction, with data openly available under CC-BY-4.0. CHAOS thus offers a versatile resource for physics-based and data-driven modeling of thermophysical properties, solvent design, and materials science, and sets the stage for future expansion to condensed-phase and reactive systems.

Abstract

Sigma-profiles obtained from quantum-chemical calculations are key molecular descriptors for solvent selection, thermodynamic modeling, and data-driven molecular design. However, existing sigma-profile libraries are limited in size and inconsistent in quality, which restricts their utility. In this work, we introduce CHAOS (Computed High-Accuracy Observables and Sigma Profiles), a large-scale and internally consistent database providing sigma-profiles for 53091 molecules, along with additional quantum-chemical observables including gas-phase geometries, single-point conductor-like polarizable continuum (C-PCM) data, infrared spectra, ideal-gas heat capacities and entropies, and atomic orbital nuclear magnetic resonance (NMR) shielding tensors. All data were generated using a standardized quantum-chemical workflow based on an wB97X-D/def2-TZVP level of theory. The CHAOS database covers molecules composed of a diverse set of elements, with molar masses up to 400 amu and dipole moments up to 15 D, and is freely available on Zenodo under an open license. It extends the number of molecules for which sigma-profiles are publicly available by more than an order of magnitude and systematically links them to a broad range of other quantum-chemical molecular descriptors. CHAOS provides a comprehensive and consistent foundation for developing models of molecular and thermodynamic properties -- both physics-based and machine-learning approaches -- across chemistry, chemical engineering, and materials science, greatly extending the possibilities and the available quantum-chemical data basis.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Schematic overview of the CHAOS data generation workflow. Starting from MOL files, defining a non-optimized geometry, molecules undergo canonicalization and 3D conformer embedding via RDKit rdkit using the ETKDG algorithm riniker2015better, followed by a geometry optimization based on a universal force field (UFF)rappe1992uff (top). The lowest-energy structure is subsequently refined via a semi-empirical conformer search using CREST pracht2020automated with GFN2-xTB bannwarth2019gfn2 in a narrow energy window (middle). The final high-level geometry optimization is performed using Gaussian 16 g16 at the $\omega$B97X-D/def2-TZVP chai2008systematicweigend2005balanced level of density functional theory (DFT) kohn1965self. Based on this geometry, subsequent calculations were performed, including harmonic frequency analysis, gauge‑including atomic orbital nuclear magnetic resonance (GIAO) NMR shielding computationsruud1993hartree, and single-point conductor-like polarizable continuum (C-PCM) calculations barone1998quantumcossi2003energies (bottom). The resulting data are further post-processed to derive molecular descriptors, such as the $\sigma$-profile. All results were compiled into structured JSON records.
  • Figure 2: Overview of the diversity of molecules included in the CHAOS database. Panels A) and B) show the distributions of the molar masses and dipole moments, respectively. Panel C) gives the relative numbers of molecules containing specific heteroatoms (light green bars) together with the relative number of molecules containing other heteroatoms (dark green bar). Additionally, the number of hydrocarbons is reported (grey bar), as well as the number of molecules containing no carbon (orange). The number of molecules in a given class is reported as a fraction of the total number of molecules $N_\mathrm{tot} =$ 53 091 in Panels A) - C). Panel D) shows the distribution of pair-wise Tanimoto similaritiesbajusz2015tanimoto (ECFP4rogers2010extendedmorgan1965generation, 2048 bits) calculated for a random sample of $10^7$ molecule pairs from CHAOS; 99.80% of the reported similarities are within the shown range.