Table of Contents
Fetching ...

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

Yinqi Zeng, Renjie Li

TL;DR

This work fills a critical gap in photoinitiator discovery by introducing QuantumChem-200K, the first large open dataset linking molecular structure to 11 quantum-chemical and photophysical properties via a hybrid DFT/semi-empirical workflow. The authors demonstrate that domain-specific fine-tuning of a chemistry-oriented LLM on QuantumChem-200K yields substantial gains in forward property prediction from SMILES, outperforming several baselines on unseen molecules and key photophysical targets like sigma780 and ISC. The dataset, alongside an evaluation benchmark and an AI assistant, enables scalable, high-throughput, AI-assisted photoinitiator screening and autonomous materials discovery, with clear pathways for future extension into excited-state dynamics and multi-modal data fusion.

Abstract

The discovery of next-generation photoinitiators for two-photon polymerization (TPP) is hindered by the absence of large, open datasets containing the quantum-chemical and photophysical properties required to model photodissociation and excited-state behavior. Existing molecular datasets typically provide only basic physicochemical descriptors and therefore cannot support data-driven screening or AI-assisted design of photoinitiators. To address this gap, we introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet-triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi-empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Using QuantumChem-200K, we fine tune the open-source Qwen2.5-32B large language model to create a chemistry AI assistant capable of forward property prediction from SMILES. Benchmarking on 3000 unseen molecules from VQM24 and ZINC20 demonstrates that domain-specific fine-tuning significantly improves accuracy over GPT-4o, Llama-3.1-70B, and the base Qwen2.5-32B model, particularly for TPA and ISC predictions central to photoinitiator design. QuantumChem-200K and the corresponding AI assistant together provide the first scalable platform for high-throughput, LLM-driven photoinitiator screening and accelerated discovery of photosensitive materials.

QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking

TL;DR

This work fills a critical gap in photoinitiator discovery by introducing QuantumChem-200K, the first large open dataset linking molecular structure to 11 quantum-chemical and photophysical properties via a hybrid DFT/semi-empirical workflow. The authors demonstrate that domain-specific fine-tuning of a chemistry-oriented LLM on QuantumChem-200K yields substantial gains in forward property prediction from SMILES, outperforming several baselines on unseen molecules and key photophysical targets like sigma780 and ISC. The dataset, alongside an evaluation benchmark and an AI assistant, enables scalable, high-throughput, AI-assisted photoinitiator screening and autonomous materials discovery, with clear pathways for future extension into excited-state dynamics and multi-modal data fusion.

Abstract

The discovery of next-generation photoinitiators for two-photon polymerization (TPP) is hindered by the absence of large, open datasets containing the quantum-chemical and photophysical properties required to model photodissociation and excited-state behavior. Existing molecular datasets typically provide only basic physicochemical descriptors and therefore cannot support data-driven screening or AI-assisted design of photoinitiators. To address this gap, we introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet-triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi-empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Using QuantumChem-200K, we fine tune the open-source Qwen2.5-32B large language model to create a chemistry AI assistant capable of forward property prediction from SMILES. Benchmarking on 3000 unseen molecules from VQM24 and ZINC20 demonstrates that domain-specific fine-tuning significantly improves accuracy over GPT-4o, Llama-3.1-70B, and the base Qwen2.5-32B model, particularly for TPA and ISC predictions central to photoinitiator design. QuantumChem-200K and the corresponding AI assistant together provide the first scalable platform for high-throughput, LLM-driven photoinitiator screening and accelerated discovery of photosensitive materials.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic of DLW: A focused photon beam is confined to a single voxel within the photo-resin. At this focal point, the photoinitiator absorbs light, undergoes dissociation, and generates reactive radicals. These radicals initiate polymerization with nearby monomers in the resin, enabling the formation of solid structures with nanoscale precision.
  • Figure 2: General workflow for the QuantumChem-200K dataset curation
  • Figure 3: wMAE of the AI assistant (Qwen2.5-32B) for each property during fine-tuning on the QuantumChem-200K dataset, where orange and blue are the wMAE at 3 and 6 epochs of training, respectively. Number on top of each bar is the per-property contribution to the overal wMAE. wMAEs here is calcualted with 100 randomly sampled data points.
  • Figure 4: Final AI assistant evaluation with wMAE for the 3000 testbank, showing an overall wMAE of 0.1975.
  • Figure 5: Ranking the wMAE of the AI assistant (orange), llama-3.1-70B (blue), Qwen2.5-32B (gray), and gpt-4o (pink) on the 3000 testbank for each property, where number on top of each bar is the ranking. Overall wMAE value of each model is recorded in Table 3.