NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

Francis G. VanGessel; Efrem Perry; Salil Mohan; Oliver M. Barham; Mark Cavolowsky

NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

Francis G. VanGessel, Efrem Perry, Salil Mohan, Oliver M. Barham, Mark Cavolowsky

TL;DR

This work demonstrates the feasibility of applying natural language processing to the energetics domain by evaluating three unsupervised models (LDA, Word2Vec, Transformer) on a large corpus (~80k energetics texts) for knowledge discovery and information extraction. It additionally builds a document classification pipeline and compares domain-specific Transformer fine-tuning against other baselines, achieving up to 76% accuracy in abstract classification and strong domain-aligned masked-language modeling performance (65.8% accuracy for the Energetics fine-tuned model). Across methods, the study shows coherent topic formation, meaningful embedding structure, and practical utility for rapid retrieval and categorization of energetic literature. The results signify a step toward accelerated energetics research and material development, while highlighting the need for domain-specific annotated datasets and scalable alignment strategies for very large language models.

Abstract

We present a demonstration of the utility of NLP for aiding research into energetic materials and associated systems. The NLP method enables machine understanding of textual data, offering an automated route to knowledge discovery and information extraction from energetics text. We apply three established unsupervised NLP models: Latent Dirichlet Allocation, Word2Vec, and the Transformer to a large curated dataset of energetics-related scientific articles. We demonstrate that each NLP algorithm is capable of identifying energetic topics and concepts, generating a language model which aligns with Subject Matter Expert knowledge. Furthermore, we present a document classification pipeline for energetics text. Our classification pipeline achieves 59-76\% accuracy depending on the NLP model used, with the highest performing Transformer model rivaling inter-annotator agreement metrics. The NLP approaches studied in this work can identify concepts germane to energetics and therefore hold promise as a tool for accelerating energetics research efforts and energetics material development.

NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

TL;DR

Abstract

Paper Structure (31 sections, 5 equations, 5 figures, 8 tables)

This paper contains 31 sections, 5 equations, 5 figures, 8 tables.

Introduction
Literature Review
ML Preliminiaries & Model Overview
ML Preliminaries
ML
Model Overview
Latent Dirichlet Allocation
Word2Vec
Transformer
Random Forest
Methodology
Data Preparation
Data Curation
Text Preprocessing
Model Training, Assessment, and Interpretation
...and 16 more sections

Figures (5)

Figure 1: Graphical overview of the three unsupervised models used in this study. The LDA topic model (a) assumes that each document is a mixture of topics, and each topic is represented as a probability distribution over words present in the corpus. In the example we have three topics related to molecule synthesis, energetic formulation ingredients, and explosive performance. Each document is assigned one, or more, of these topics according to the thematic elements of the document. The word embedding model (b) seeks to build a predictive model for a center word in a sequence given the surrounding context. In this example, information from every word within the context is assimilated, via individual shallow neural networks, into prediction of the center word, PBXN-111. The Transformer model (c) is trained in a self-supervised fashion to predict a masked word in a sequence. In this model, each word of a sequence is transformed into a query, key, and value vector which are combined via the attention process into an attention score. The masked word prediction component then uses the attention score of every word in the sequence to predicts the masked word. In contrast to the model, which equally weights each word within a finite width context window, the Transformer considers all words within the sequence, attending more strongly to informative words (e.g. explosive, HMX, and metal) while assigning less weight to uninformative words.
Figure 2: Training and validation procedures for , , and Transformer models.
Figure 3: Word embeddings generated by model. the T-SNE algorithm had been used to reduce the dimensionality to two dimensions to aid in visualization and analysis. A select number of embeddings groups have been annotated to highlight clustering of energetic concepts. In the upper right corner, several words related to computational modeling of shock physics and high impact events have been labeled in orange. Below the modeling cluster on the right-hand side, several words related to thermodynamic processes associated with detonation dynamics have been labeled in red. At the central bottom portion of the figure, molecular synthesis concepts and high-explosive molecule names have been highlighted in blue. In the upper left corner, phrases related to propellants have been identified with green labels. Finally, certain words with no apparent thematically coherent cluster have been labeled in black.
Figure 5: The cross entropy loss (calculated for training and validation data sets) and masked token prediction accuracy (calculated on only the validation data set) plotted with respect to number of epochs.
Figure 6: Document classification accuracy of energetic abstracts using various models as a featurization method. The dark red color indicates Transformer variant models. The black bar represents the standard deviation of the random forest classifier. All reported metrics are obtained from five-fold cross validation.

NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

TL;DR

Abstract

NLP for Knowledge Discovery and Information Extraction from Energetics Corpora

Authors

TL;DR

Abstract

Table of Contents

Figures (5)