Chemical Reaction Extraction from Long Patent Documents
Aishwarya Jadhav, Ritam Dutt
TL;DR
This work tackles extracting reaction spans from long chemical patents to support a reaction-centric patent knowledge base (ChemPatKB). It extends prior baseline models with BERT-based embeddings, chemical-domain pretraining (ChemBERT), and [CHEM] tokens, framing extraction as paragraph-level IOB span tagging. Through in-domain and cross-domain evaluations on the Chemu dataset, the study finds that BiLSTM-CRF decoders with finetuned BERT variants yield strong performance, while [CHEM] tokens improve generalization across domains. The results motivate future multi-task learning with Chemical NER and the creation of a standardized gold benchmark to enable robust cross-domain reaction extraction in patents.
Abstract
The task of searching through patent documents is crucial for chemical patent recommendation and retrieval. This can be enhanced by creating a patent knowledge base (ChemPatKB) to aid in prior art searches and to provide a platform for domain experts to explore new innovations in chemical compound synthesis and use-cases. An essential foundational component of this KB is the extraction of important reaction snippets from long patents documents which facilitates multiple downstream tasks such as reaction co-reference resolution and chemical entity role identification. In this work, we explore the problem of extracting reactions spans from chemical patents in order to create a reactions resource database. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs that contain a description of a reaction. We propose several approaches and modifications of the baseline models and study how different methods generalize across different domains of chemical patents.
