Table of Contents
Fetching ...

Generative Chemical Language Models for Energetic Materials Discovery

Andrew Salij, R. Seaton Ullberg, Megan C. Davis, Marc J. Cawkwell, Christopher J. Snyder, Cristina Garcia Cardona, Ivana Matanovic, Wilton J. M. Kort-Kamp

Abstract

The discovery of new energetic materials remains a pressing challenge hindered by limited availability of high-quality data. To address this, we have developed generative molecular language models that have been pretrained on extensive chemical data and then fine-tuned with curated energetic materials datasets. This transfer-learning strategy extends the chemical language model capabilities beyond the pharmacological space in which they have been predominantly developed, offering a framework applicable to other data-spare discovery problems. Furthermore, we discuss the benefits of fragment-based molecular encodings for chemical language models, in particular in constructing synthetically accessible structures. Together, these advances provide a foundation for accelerating the design of next-generation energetic materials with demanding performance requirements.

Generative Chemical Language Models for Energetic Materials Discovery

Abstract

The discovery of new energetic materials remains a pressing challenge hindered by limited availability of high-quality data. To address this, we have developed generative molecular language models that have been pretrained on extensive chemical data and then fine-tuned with curated energetic materials datasets. This transfer-learning strategy extends the chemical language model capabilities beyond the pharmacological space in which they have been predominantly developed, offering a framework applicable to other data-spare discovery problems. Furthermore, we discuss the benefits of fragment-based molecular encodings for chemical language models, in particular in constructing synthetically accessible structures. Together, these advances provide a foundation for accelerating the design of next-generation energetic materials with demanding performance requirements.

Paper Structure

This paper contains 25 sections, 3 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: a) Training pipeline for GPT models, staged into pretraining to produce a wide variety of molecules and fine-tuning, which produces many C-, N-, O-containing compounds. b) Scheme of GPT model architecture and data processing.
  • Figure 2: a) Synthetic accessibility (SA) score ertl2009estimation distributions for unconditioned molecular outputs of pretrained $\chi$hem- and fine-tuned X-GPT models with a) number of heavy atoms generated and b) predicted detonation velocities via ChemProp heid2023chemprop surrogate. All subfigures are normalized such that the highest histogram bin is 1.
  • Figure 3: Comparison of estimated detonation velocities and pressures from Kamlet-Jacobs equations kamlet1968chemistry for unconditioned (left column) and conditioned generation (right) for fine-tuned GroupSELFIES models compared to the base model. All subfigures have been normalized to an identical maximum value.
  • Figure 4: Distributions of a) number of nitrogen-oxygen bonds, b) number of nitrogen-nitrogen bonds, c) quantitative estimation of druglikeness (QED), and d) synthetic accessibility score (SA Score) in large SELFIES-based GPT models.
  • Figure 5: Common substructures of generated output from chemical language models that have been manually selected for diversity. Substructures were chosen as representative samples from the 200 most common subgraphs between size $5-10$ as obtained via RDKit rdkitsoftware.
  • ...and 13 more figures