QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking
Yinqi Zeng, Renjie Li
TL;DR
This work fills a critical gap in photoinitiator discovery by introducing QuantumChem-200K, the first large open dataset linking molecular structure to 11 quantum-chemical and photophysical properties via a hybrid DFT/semi-empirical workflow. The authors demonstrate that domain-specific fine-tuning of a chemistry-oriented LLM on QuantumChem-200K yields substantial gains in forward property prediction from SMILES, outperforming several baselines on unseen molecules and key photophysical targets like sigma780 and ISC. The dataset, alongside an evaluation benchmark and an AI assistant, enables scalable, high-throughput, AI-assisted photoinitiator screening and autonomous materials discovery, with clear pathways for future extension into excited-state dynamics and multi-modal data fusion.
Abstract
The discovery of next-generation photoinitiators for two-photon polymerization (TPP) is hindered by the absence of large, open datasets containing the quantum-chemical and photophysical properties required to model photodissociation and excited-state behavior. Existing molecular datasets typically provide only basic physicochemical descriptors and therefore cannot support data-driven screening or AI-assisted design of photoinitiators. To address this gap, we introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with eleven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet-triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, hydrophilicity, solubility, boiling point, molecular weight, and aromaticity. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi-empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Using QuantumChem-200K, we fine tune the open-source Qwen2.5-32B large language model to create a chemistry AI assistant capable of forward property prediction from SMILES. Benchmarking on 3000 unseen molecules from VQM24 and ZINC20 demonstrates that domain-specific fine-tuning significantly improves accuracy over GPT-4o, Llama-3.1-70B, and the base Qwen2.5-32B model, particularly for TPA and ISC predictions central to photoinitiator design. QuantumChem-200K and the corresponding AI assistant together provide the first scalable platform for high-throughput, LLM-driven photoinitiator screening and accelerated discovery of photosensitive materials.
