Table of Contents
Fetching ...

ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs

Arav Parikh, Shiri Dori-Hacohen

TL;DR

ClaimCompare presents a data-pipeline approach to generate labeled patent-claim datasets for evaluating novelty-destroying prior art. By leveraging USPTO Bulk Data and Office Action APIs alongside Google Patents, it constructs domain-focused datasets that pair base patents with novelty-destroying and related non-novelty destroying patents. Fine-tuning transformer models on the resulting data yields substantial improvements in ranking and detection metrics, validating the utility of the pipeline. The work offers a US-centric, public dataset and methodology to accelerate novelty determinations, with potential for broader domain expansion and integration with advanced language models. The dataset and code are released to enable replication and extension by researchers and practitioners.

Abstract

A fundamental step in the patent application process is the determination of whether there exist prior patents that are novelty destroying. This step is routinely performed by both applicants and examiners, in order to assess the novelty of proposed inventions among the millions of applications filed annually. However, conducting this search is time and labor-intensive, as searchers must navigate complex legal and technical jargon while covering a large amount of legal claims. Automated approaches using information retrieval and machine learning approaches to detect novelty destroying patents present a promising avenue to streamline this process, yet research focusing on this space remains limited. In this paper, we introduce a novel data pipeline, ClaimCompare, designed to generate labeled patent claim datasets suitable for training IR and ML models to address this challenge of novelty destruction assessment. To the best of our knowledge, ClaimCompare is the first pipeline that can generate multiple novelty destroying patent datasets. To illustrate the practical relevance of this pipeline, we utilize it to construct a sample dataset comprising of over 27K patents in the electrochemical domain: 1,045 base patents from USPTO, each associated with 25 related patents labeled according to their novelty destruction towards the base patent. Subsequently, we conduct preliminary experiments showcasing the efficacy of this dataset in fine-tuning transformer models to identify novelty destroying patents, demonstrating 29.2% and 32.7% absolute improvement in MRR and P@1, respectively.

ClaimCompare: A Data Pipeline for Evaluation of Novelty Destroying Patent Pairs

TL;DR

ClaimCompare presents a data-pipeline approach to generate labeled patent-claim datasets for evaluating novelty-destroying prior art. By leveraging USPTO Bulk Data and Office Action APIs alongside Google Patents, it constructs domain-focused datasets that pair base patents with novelty-destroying and related non-novelty destroying patents. Fine-tuning transformer models on the resulting data yields substantial improvements in ranking and detection metrics, validating the utility of the pipeline. The work offers a US-centric, public dataset and methodology to accelerate novelty determinations, with potential for broader domain expansion and integration with advanced language models. The dataset and code are released to enable replication and extension by researchers and practitioners.

Abstract

A fundamental step in the patent application process is the determination of whether there exist prior patents that are novelty destroying. This step is routinely performed by both applicants and examiners, in order to assess the novelty of proposed inventions among the millions of applications filed annually. However, conducting this search is time and labor-intensive, as searchers must navigate complex legal and technical jargon while covering a large amount of legal claims. Automated approaches using information retrieval and machine learning approaches to detect novelty destroying patents present a promising avenue to streamline this process, yet research focusing on this space remains limited. In this paper, we introduce a novel data pipeline, ClaimCompare, designed to generate labeled patent claim datasets suitable for training IR and ML models to address this challenge of novelty destruction assessment. To the best of our knowledge, ClaimCompare is the first pipeline that can generate multiple novelty destroying patent datasets. To illustrate the practical relevance of this pipeline, we utilize it to construct a sample dataset comprising of over 27K patents in the electrochemical domain: 1,045 base patents from USPTO, each associated with 25 related patents labeled according to their novelty destruction towards the base patent. Subsequently, we conduct preliminary experiments showcasing the efficacy of this dataset in fine-tuning transformer models to identify novelty destroying patents, demonstrating 29.2% and 32.7% absolute improvement in MRR and P@1, respectively.
Paper Structure (12 sections, 1 figure, 2 tables)

This paper contains 12 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The ClaimCompare Pipeline. ClaimCompare can accept any initial seed queries and generate a novelty destroying dataset for that query set. For each base patent, the pipeline finds $k$ novelty destroying and related, non-novelty destroying patents. The pipeline utilizes two types of API calls to the USPTO (Bulk Data and Office Action APIs) and scrapes data from Google Patents to account for patent number changes.