AutoIE: An Automated Framework for Information Extraction from Scientific Literature
Yangyang Liu, Shoubin Li
TL;DR
AutoIE addresses the challenge of extracting key information from vast scientific literature by proposing a three-unit framework for automated information extraction from PDFs: Layout and Location (MFFAPD and AFBRSC), Information Extraction (SBERT with transfer learning), and Display/Feedback (online learning via OLPTM). It demonstrates strong generalization on standard benchmarks (CoNLL04 and ADE) with Macro F1 scores of 87.19 and 89.65, respectively, and shows domain transfer to molecular sieve synthesis achieving 78% overall accuracy. The approach combines multi-dimensional features (span, width, CLS, POS) within a two-stage SBERT architecture to perform joint entity and relation extraction, enhanced by an annotation loop that continually refines training data. In a practical molecular sieve application, AutoIE delivers substantial speedups (over 3x) and scalable labeling, underscoring its potential to improve information management and discovery in specialized scientific domains; future work will explore incorporating large language models to further enhance performance.
Abstract
In the rapidly evolving field of scientific research, efficiently extracting key information from the burgeoning volume of scientific papers remains a formidable challenge. This paper introduces an innovative framework designed to automate the extraction of vital data from scientific PDF documents, enabling researchers to discern future research trajectories more readily. AutoIE uniquely integrates four novel components: (1) A multi-semantic feature fusion-based approach for PDF document layout analysis; (2) Advanced functional block recognition in scientific texts; (3) A synergistic technique for extracting and correlating information on molecular sieve synthesis; (4) An online learning paradigm tailored for molecular sieve literature. Our SBERT model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE datasets. In addition, a practical application of AutoIE in the petrochemical molecular sieve synthesis domain demonstrates its efficacy, evidenced by an impressive 78\% accuracy rate. This research paves the way for enhanced data management and interpretation in molecular sieve synthesis. It is a valuable asset for seasoned experts and newcomers in this specialized field.
