Table of Contents
Fetching ...

Chemical Reaction Extraction from Long Patent Documents

Aishwarya Jadhav, Ritam Dutt

TL;DR

This work tackles extracting reaction spans from long chemical patents to support a reaction-centric patent knowledge base (ChemPatKB). It extends prior baseline models with BERT-based embeddings, chemical-domain pretraining (ChemBERT), and [CHEM] tokens, framing extraction as paragraph-level IOB span tagging. Through in-domain and cross-domain evaluations on the Chemu dataset, the study finds that BiLSTM-CRF decoders with finetuned BERT variants yield strong performance, while [CHEM] tokens improve generalization across domains. The results motivate future multi-task learning with Chemical NER and the creation of a standardized gold benchmark to enable robust cross-domain reaction extraction in patents.

Abstract

The task of searching through patent documents is crucial for chemical patent recommendation and retrieval. This can be enhanced by creating a patent knowledge base (ChemPatKB) to aid in prior art searches and to provide a platform for domain experts to explore new innovations in chemical compound synthesis and use-cases. An essential foundational component of this KB is the extraction of important reaction snippets from long patents documents which facilitates multiple downstream tasks such as reaction co-reference resolution and chemical entity role identification. In this work, we explore the problem of extracting reactions spans from chemical patents in order to create a reactions resource database. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs that contain a description of a reaction. We propose several approaches and modifications of the baseline models and study how different methods generalize across different domains of chemical patents.

Chemical Reaction Extraction from Long Patent Documents

TL;DR

This work tackles extracting reaction spans from long chemical patents to support a reaction-centric patent knowledge base (ChemPatKB). It extends prior baseline models with BERT-based embeddings, chemical-domain pretraining (ChemBERT), and [CHEM] tokens, framing extraction as paragraph-level IOB span tagging. Through in-domain and cross-domain evaluations on the Chemu dataset, the study finds that BiLSTM-CRF decoders with finetuned BERT variants yield strong performance, while [CHEM] tokens improve generalization across domains. The results motivate future multi-task learning with Chemical NER and the creation of a standardized gold benchmark to enable robust cross-domain reaction extraction in patents.

Abstract

The task of searching through patent documents is crucial for chemical patent recommendation and retrieval. This can be enhanced by creating a patent knowledge base (ChemPatKB) to aid in prior art searches and to provide a platform for domain experts to explore new innovations in chemical compound synthesis and use-cases. An essential foundational component of this KB is the extraction of important reaction snippets from long patents documents which facilitates multiple downstream tasks such as reaction co-reference resolution and chemical entity role identification. In this work, we explore the problem of extracting reactions spans from chemical patents in order to create a reactions resource database. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs that contain a description of a reaction. We propose several approaches and modifications of the baseline models and study how different methods generalize across different domains of chemical patents.
Paper Structure (19 sections, 1 equation, 1 figure, 5 tables)

This paper contains 19 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Baseline Architecture. The left figure illustrates the general architecture of the whole model, while the right figure details the 3 decoder components.