ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs
Arav Parikh, Shiri Dori-Hacohen
TL;DR
ClaimCompare presents a data-pipeline approach to generate labeled patent-claim datasets for evaluating novelty-destroying prior art. By leveraging USPTO Bulk Data and Office Action APIs alongside Google Patents, it constructs domain-focused datasets that pair base patents with novelty-destroying and related non-novelty destroying patents. Fine-tuning transformer models on the resulting data yields substantial improvements in ranking and detection metrics, validating the utility of the pipeline. The work offers a US-centric, public dataset and methodology to accelerate novelty determinations, with potential for broader domain expansion and integration with advanced language models. The dataset and code are released to enable replication and extension by researchers and practitioners.
Abstract
A fundamental step in the patent application process is the determination of whether there exist prior patents that are novelty destroying. This step is routinely performed by both applicants and examiners, in order to assess the novelty of proposed inventions among the millions of applications filed annually. However, conducting this search is time and labor-intensive, as searchers must navigate complex legal and technical jargon while covering a large amount of legal claims. Automated approaches using information retrieval and machine learning approaches to detect novelty destroying patents present a promising avenue to streamline this process, yet research focusing on this space remains limited. In this paper, we introduce a novel data pipeline, ClaimCompare, designed to generate labeled patent claim datasets suitable for training IR and ML models to address this challenge of novelty destruction assessment. To the best of our knowledge, ClaimCompare is the first pipeline that can generate multiple novelty destroying patent datasets. To illustrate the practical relevance of this pipeline, we utilize it to construct a sample dataset comprising of over 27K patents in the electrochemical domain: 1,045 base patents from USPTO, each associated with 25 related patents labeled according to their novelty destruction towards the base patent. Subsequently, we conduct preliminary experiments showcasing the efficacy of this dataset in fine-tuning transformer models to identify novelty destroying patents, demonstrating 29.2% and 32.7% absolute improvement in MRR and P@1, respectively.
