PaECTER: Patent-level Representation Learning using Citation-informed Transformers

Mainak Ghosh; Michael E. Rose; Sebastian Erhardt; Erik Buunk; Dietmar Harhoff

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

Mainak Ghosh, Michael E. Rose, Sebastian Erhardt, Erik Buunk, Dietmar Harhoff

TL;DR

PaECTER addresses patent-level semantic similarity by combining domain-specific vocabulary with a citation-informed contrastive learning objective built on BERT for Patents. It trains with a triplet margin loss $ \max\{(\lVert V_F - V_P \rVert_2 - \lVert V_F - V_N \rVert_2 + m), 0\}$ on 300k focal patents and associated positives/negatives, yielding document embeddings that surpass specialized patent and general embeddings on rank-based metrics. Across multiple baselines and datasets, PaECTER achieves superior Rank First Relevant, MAP, and MRR@10, and outperforms the public SEARCHFORMER on a shared test subset, indicating more efficient prior-art retrieval. The model and training data are publicly available on HuggingFace, enabling downstream patent analytics tasks such as classification, knowledge-flow tracing, and semantic prior-art search.

Abstract

PaECTER is an open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the patent specific pre-trained language model (BERT for Patents) and general-purpose text embedding models (e.g., E5, GTE, and BGE) on our patent citation prediction test dataset on different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners.

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

TL;DR

on 300k focal patents and associated positives/negatives, yielding document embeddings that surpass specialized patent and general embeddings on rank-based metrics. Across multiple baselines and datasets, PaECTER achieves superior Rank First Relevant, MAP, and MRR@10, and outperforms the public SEARCHFORMER on a shared test subset, indicating more efficient prior-art retrieval. The model and training data are publicly available on HuggingFace, enabling downstream patent analytics tasks such as classification, knowledge-flow tracing, and semantic prior-art search.

Abstract

Paper Structure (18 sections, 2 equations, 3 figures, 4 tables)

This paper contains 18 sections, 2 equations, 3 figures, 4 tables.

Introduction
Training Data
Objective
Focal Patents
Positive Citations
Negative Citations
Test Dataset
Training
Evaluation
Performance Evaluation
Comparison with SEARCHFORMER
External Evaluation
Ablation method
Conclusion
Limitations
...and 3 more sections

Figures (3)

Figure 1: Relationship of PaECTER model to existing models
Figure 2: Positive, easy and hard negatives in patent training selection
Figure 3: ECDF for the distribution of RFR scores across different models

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

TL;DR

Abstract

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)