Table of Contents
Fetching ...

SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes

Siddharth Verma, Alankar Alankar

TL;DR

This work addresses rapid discovery of high-energy materials in a data-scarce regime by coupling a lightweight LSTM-based SMILES generator with partially trainable, SHA-256 derived fixed embeddings and an AttentiveFP graph neural network for multi-target property prediction. The method enables data-efficient exploration on commodity hardware, achieving 67.5% validity and 37.5% novelty in generation, and identifying 37 candidates with predicted detonation velocity above 9 km/s from a public 303-molecule dataset. The SHA-256 embedding provides a strong inductive bias that improves generalization, reduces memory usage, and avoids pretraining, while AttentiveFP delivers robust property predictions across nine energetic descriptors. This approach yields thousands of novel energetic candidates and reveals design motifs (e.g., azole cores with nitrate ester and nitramine substitutions) that balance performance with synthetic feasibility, offering a practical path for low-resource discovery and highlighting ethical considerations for dual-use applications.

Abstract

High-energy materials (HEMs) are critical for propulsion and defense domains, yet their discovery remains constrained by experimental data and restricted access to testing facilities. This work presents a novel approach toward high-energy molecules by combining Long Short-Term Memory (LSTM) networks for molecular generation and Attentive Graph Neural Networks (GNN) for property predictions. We propose a transformative embedding space construction strategy that integrates fixed SHA-256 embeddings with partially trainable representations. Unlike conventional regularization techniques, this changes the representational basis itself, reshaping the molecular input space before learning begins. Without recourse to pretraining, the generator achieves 67.5% validity and 37.5% novelty. The generated library exhibits a mean Tanimoto coefficient of 0.214 relative to training set signifying the ability of framework to generate a diverse chemical space. We identified 37 new super explosives higher than 9 km/s predicted detonation velocity.

SHA-256 Infused Embedding-Driven Generative Modeling of High-Energy Molecules in Low-Data Regimes

TL;DR

This work addresses rapid discovery of high-energy materials in a data-scarce regime by coupling a lightweight LSTM-based SMILES generator with partially trainable, SHA-256 derived fixed embeddings and an AttentiveFP graph neural network for multi-target property prediction. The method enables data-efficient exploration on commodity hardware, achieving 67.5% validity and 37.5% novelty in generation, and identifying 37 candidates with predicted detonation velocity above 9 km/s from a public 303-molecule dataset. The SHA-256 embedding provides a strong inductive bias that improves generalization, reduces memory usage, and avoids pretraining, while AttentiveFP delivers robust property predictions across nine energetic descriptors. This approach yields thousands of novel energetic candidates and reveals design motifs (e.g., azole cores with nitrate ester and nitramine substitutions) that balance performance with synthetic feasibility, offering a practical path for low-resource discovery and highlighting ethical considerations for dual-use applications.

Abstract

High-energy materials (HEMs) are critical for propulsion and defense domains, yet their discovery remains constrained by experimental data and restricted access to testing facilities. This work presents a novel approach toward high-energy molecules by combining Long Short-Term Memory (LSTM) networks for molecular generation and Attentive Graph Neural Networks (GNN) for property predictions. We propose a transformative embedding space construction strategy that integrates fixed SHA-256 embeddings with partially trainable representations. Unlike conventional regularization techniques, this changes the representational basis itself, reshaping the molecular input space before learning begins. Without recourse to pretraining, the generator achieves 67.5% validity and 37.5% novelty. The generated library exhibits a mean Tanimoto coefficient of 0.214 relative to training set signifying the ability of framework to generate a diverse chemical space. We identified 37 new super explosives higher than 9 km/s predicted detonation velocity.

Paper Structure

This paper contains 27 sections, 9 theorems, 35 equations, 16 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathcal{H}_{d_t} = \{f_\theta(\cdot; E_t, E_f): \theta \in \Theta, E_t \in \mathcal{E}_t\}$ denote the hypothesis class for embedding dimension $d_t$, with parameter spaces constrained by Assumption assump:model. The empirical Rademacher complexity is: where $\sigma_i \sim \{\pm 1\}$ are independent Rademacher variables.

Figures (16)

  • Figure 1: (a) Molecule structure and (b) Molecular weight histogram.
  • Figure 2: Comparison of X-X bond distribution (a) and X-X-X bond distribution (b) across the dataset.
  • Figure 3: The scatter plot above describes the correlation between experimental impact sensitivity and detonation velocity across different energetic material categories with log($h_{50}(obs))(cm)$ on the x-axis versus detonation velocity ($km. s^{-1}$, D) on the y-axis. 'obs' refers to observation. Distinct colors and marker shapes denote functional categories of energetic groups. RDX and PETN have been labeled for reference. The trend highlights the trade-off between higher detonation performance and reduced mechanical stability.
  • Figure 4: Schematic overview of the generative model architecture used for high-energy molecule generation. The molecular structure is first represented using the SMILES sequence which is then tokenized and passed through an embedding layer. A dropout layer introduces regularization to mitigate overfitting in the low-data regime. The core LSTM architecture processes the sequential data, where each LSTM unit updates hidden states through the interaction of input, forget, and output gates as per standard LSTM dynamics. The processed latent representation is further passed through another dropout layer and a fully connected linear layer to decode the next character in the sequence. This lightweight model architecture facilitates rapid generation of valid, novel, and diverse high-energy molecules using minimal computational resources.
  • Figure 5: (a) Graph-based heuristic for feature extraction, and (b) schematic of a graphical neural network with attention.
  • ...and 11 more figures

Theorems & Definitions (18)

  • Lemma : Hypothesis Class
  • Lemma : Rademacher Complexity Bound
  • proof
  • Proposition : Generalization Bound
  • proof
  • Remark
  • Definition : Embedding Coherence
  • Lemma : Coherence Bound
  • proof
  • Proposition : Gram Matrix Conditioning
  • ...and 8 more