Automated Extraction of Acronym-Expansion Pairs from Scientific Papers
Izhar Ali, Million Haileyesus, Serhiy Hnatyshyn, Jan-Lucas Ott, Vasil Hnatyshin
TL;DR
The paper addresses the challenge of abundant and variable acronym usage in scientific literature by proposing a hybrid pipeline that fuses document preprocessing, a regular-expression–based parser, and GPT-4 for disambiguating and expanding acronyms. It demonstrates that neither regex nor GPT-4 alone suffices; combining them with careful context handling yields higher accuracy in extracting acronym–expansion pairs, particularly across diverse domains. Evaluation on 200 arXiv papers shows that the integrated approach outperforms single-method baselines, while highlighting ongoing limits such as lowercase-start acronyms and domain-specific variations. The work advances practical NLP capabilities for acronym disambiguation, with potential benefits for information retrieval, indexing, and downstream analytical tasks in scholarly corpora.
Abstract
This project addresses challenges posed by the widespread use of abbreviations and acronyms in digital texts. We propose a novel method that combines document preprocessing, regular expressions, and a large language model to identify abbreviations and map them to their corresponding expansions. The regular expressions alone are often insufficient to extract expansions, at which point our approach leverages GPT-4 to analyze the text surrounding the acronyms. By limiting the analysis to only a small portion of the surrounding text, we mitigate the risk of obtaining incorrect or multiple expansions for an acronym. There are several known challenges in processing text with acronyms, including polysemous acronyms, non-local and ambiguous acronyms. Our approach enhances the precision and efficiency of NLP techniques by addressing these issues with automated acronym identification and disambiguation. This study highlights the challenges of working with PDF files and the importance of document preprocessing. Furthermore, the results of this work show that neither regular expressions nor GPT-4 alone can perform well. Regular expressions are suitable for identifying acronyms but have limitations in finding their expansions within the paper due to a variety of formats used for expressing acronym-expansion pairs and the tendency of authors to omit expansions within the text. GPT-4, on the other hand, is an excellent tool for obtaining expansions but struggles with correctly identifying all relevant acronyms. Additionally, GPT-4 poses challenges due to its probabilistic nature, which may lead to slightly different results for the same input. Our algorithm employs preprocessing to eliminate irrelevant information from the text, regular expressions for identifying acronyms, and a large language model to help find acronym expansions to provide the most accurate and consistent results.
