NLP for Knowledge Discovery and Information Extraction from Energetics Corpora
Francis G. VanGessel, Efrem Perry, Salil Mohan, Oliver M. Barham, Mark Cavolowsky
TL;DR
This work demonstrates the feasibility of applying natural language processing to the energetics domain by evaluating three unsupervised models (LDA, Word2Vec, Transformer) on a large corpus (~80k energetics texts) for knowledge discovery and information extraction. It additionally builds a document classification pipeline and compares domain-specific Transformer fine-tuning against other baselines, achieving up to 76% accuracy in abstract classification and strong domain-aligned masked-language modeling performance (65.8% accuracy for the Energetics fine-tuned model). Across methods, the study shows coherent topic formation, meaningful embedding structure, and practical utility for rapid retrieval and categorization of energetic literature. The results signify a step toward accelerated energetics research and material development, while highlighting the need for domain-specific annotated datasets and scalable alignment strategies for very large language models.
Abstract
We present a demonstration of the utility of NLP for aiding research into energetic materials and associated systems. The NLP method enables machine understanding of textual data, offering an automated route to knowledge discovery and information extraction from energetics text. We apply three established unsupervised NLP models: Latent Dirichlet Allocation, Word2Vec, and the Transformer to a large curated dataset of energetics-related scientific articles. We demonstrate that each NLP algorithm is capable of identifying energetic topics and concepts, generating a language model which aligns with Subject Matter Expert knowledge. Furthermore, we present a document classification pipeline for energetics text. Our classification pipeline achieves 59-76\% accuracy depending on the NLP model used, with the highest performing Transformer model rivaling inter-annotator agreement metrics. The NLP approaches studied in this work can identify concepts germane to energetics and therefore hold promise as a tool for accelerating energetics research efforts and energetics material development.
