PaECTER: Patent-level Representation Learning using Citation-informed Transformers
Mainak Ghosh, Michael E. Rose, Sebastian Erhardt, Erik Buunk, Dietmar Harhoff
TL;DR
PaECTER addresses patent-level semantic similarity by combining domain-specific vocabulary with a citation-informed contrastive learning objective built on BERT for Patents. It trains with a triplet margin loss $ \max\{(\lVert V_F - V_P \rVert_2 - \lVert V_F - V_N \rVert_2 + m), 0\}$ on 300k focal patents and associated positives/negatives, yielding document embeddings that surpass specialized patent and general embeddings on rank-based metrics. Across multiple baselines and datasets, PaECTER achieves superior Rank First Relevant, MAP, and MRR@10, and outperforms the public SEARCHFORMER on a shared test subset, indicating more efficient prior-art retrieval. The model and training data are publicly available on HuggingFace, enabling downstream patent analytics tasks such as classification, knowledge-flow tracing, and semantic prior-art search.
Abstract
PaECTER is an open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the patent specific pre-trained language model (BERT for Patents) and general-purpose text embedding models (e.g., E5, GTE, and BGE) on our patent citation prediction test dataset on different rank evaluation metrics. PaECTER predicts at least one most similar patent at a rank of 1.32 on average when compared against 25 irrelevant patents. Numerical representations generated by PaECTER from patent text can be used for downstream tasks such as classification, tracing knowledge flows, or semantic similarity search. Semantic similarity search is especially relevant in the context of prior art search for both inventors and patent examiners.
