Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

Fabio Falcioni; Elena Orlova; Timothy Heightman; Philip Mantrov; Aleksei Ustimenko

Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

Fabio Falcioni, Elena Orlova, Timothy Heightman, Philip Mantrov, Aleksei Ustimenko

TL;DR

The paper addresses the cost and reliability bottlenecks of generating high-accuracy ab initio data for chemical discovery by combining Large Wavefunction Models with Variational Monte Carlo and a novel RELAX sampling scheme. RELAX merges Riemannian Langevin dynamics, replica exchange, and a global replacement kernel to drastically reduce autocorrelation and increase effective sample size, achieving an average cost reduction of about $28\times$ versus Microsoft’s Orbformer while maintaining energy accuracy near the gold standard. Practically, this enables affordable, large-scale ab initio datasets for AI-driven drug design and materials discovery, with improved scaling: the observed practical cost scales as roughly $N_{electrons}^{3.3}$ compared with CCSD’s $O(N_{basis}^{6 ext{--}7})$ and the baseline methods. Overall, the approach delivers significant efficiency gains and scalable, physics-grounded data labels for next-generation quantum-driven pharmaceutical and materials research.

Abstract

In this work, we benchmark \simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).

Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

TL;DR

Abstract

Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)