Table of Contents
Fetching ...

Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

Fabio Falcioni, Elena Orlova, Timothy Heightman, Philip Mantrov, Aleksei Ustimenko

TL;DR

The paper addresses the cost and reliability bottlenecks of generating high-accuracy ab initio data for chemical discovery by combining Large Wavefunction Models with Variational Monte Carlo and a novel RELAX sampling scheme. RELAX merges Riemannian Langevin dynamics, replica exchange, and a global replacement kernel to drastically reduce autocorrelation and increase effective sample size, achieving an average cost reduction of about $28\times$ versus Microsoft’s Orbformer while maintaining energy accuracy near the gold standard. Practically, this enables affordable, large-scale ab initio datasets for AI-driven drug design and materials discovery, with improved scaling: the observed practical cost scales as roughly $N_{electrons}^{3.3}$ compared with CCSD’s $O(N_{basis}^{6 ext{--}7})$ and the baseline methods. Overall, the approach delivers significant efficiency gains and scalable, physics-grounded data labels for next-generation quantum-driven pharmaceutical and materials research.

Abstract

In this work, we benchmark \simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).

Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences

TL;DR

The paper addresses the cost and reliability bottlenecks of generating high-accuracy ab initio data for chemical discovery by combining Large Wavefunction Models with Variational Monte Carlo and a novel RELAX sampling scheme. RELAX merges Riemannian Langevin dynamics, replica exchange, and a global replacement kernel to drastically reduce autocorrelation and increase effective sample size, achieving an average cost reduction of about versus Microsoft’s Orbformer while maintaining energy accuracy near the gold standard. Practically, this enables affordable, large-scale ab initio datasets for AI-driven drug design and materials discovery, with improved scaling: the observed practical cost scales as roughly compared with CCSD’s and the baseline methods. Overall, the approach delivers significant efficiency gains and scalable, physics-grounded data labels for next-generation quantum-driven pharmaceutical and materials research.

Abstract

In this work, we benchmark \simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).

Paper Structure

This paper contains 21 sections, 25 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Autocorrelation time $\tau$ and value for different systems. Lower values are better. simulacra AI maintains $\tau < 10$ and autocorrelation $\sim$0.7-0.8, producing largely independent samples, while Microsoft exhibits poor scaling with $\tau > 100$ and autocorrelation $\sim$0.95-1.0, indicating highly correlated samples that provide minimal additional statistical information.
  • Figure 2: Effective sample size for $10^4$ iterations with 1024 walkers. ESS quantifies the number of statistically independent samples produced; higher values indicate better sampling efficiency. simulacra AI's RELAX (blue) maintains ESS of $\sim 1.2-6 \times 10^6$ across all system sizes, while standard MALA (red) has much less ($<10^5$) independent samples for large systems due to severe autocorrelation.
  • Figure 3: Cost per system (in USD) depending on the number of atoms and the number of electrons. simulacra AI AI's pipeline achieves 15-50$\times$ cost reduction relative to Microsoft's one across all system sizes, with improved power-law scaling ($N_{atoms}^{4.4}$ vs. $N_{atoms}^{4.9}$; $N_{electrons}^{3.3}$ vs. $N_{electrons}^{3.7}$). Both VMC approaches significantly outperform CCSD methods for systems with $\ge 10$ atoms, with cost advantages increasing for larger molecules.