Table of Contents
Fetching ...

Patent Representation Learning via Self-supervision

You Zuo, Kim Gerdes, Eric Villemonte de La Clergerie, Benoît Sagot

TL;DR

The paper addresses patent document understanding without labeled data by identifying a patent-specific failure mode of dropout-based contrastive learning, namely over-dispersion of embeddings. It introduces section-based augmentation, using intra-document views from Title+Abstract and other sections such as Claims or Background to generate diverse positives, and combines this with dropout positives in a self-supervised framework. Empirical results on large-scale patent data show that section-based augmentation improves prior-art retrieval and IPC classification, achieving performance competitive with supervised baselines while maintaining better embedding geometry. The findings demonstrate that exploiting the discourse structure of patents yields embeddings that are both locally coherent and globally well distributed, suggesting scalable and robust patent representations. The work highlights the value of intra-document views for self-supervised patent understanding and provides reproducible code and evaluation resources.

Abstract

This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

Patent Representation Learning via Self-supervision

TL;DR

The paper addresses patent document understanding without labeled data by identifying a patent-specific failure mode of dropout-based contrastive learning, namely over-dispersion of embeddings. It introduces section-based augmentation, using intra-document views from Title+Abstract and other sections such as Claims or Background to generate diverse positives, and combines this with dropout positives in a self-supervised framework. Empirical results on large-scale patent data show that section-based augmentation improves prior-art retrieval and IPC classification, achieving performance competitive with supervised baselines while maintaining better embedding geometry. The findings demonstrate that exploiting the discourse structure of patents yields embeddings that are both locally coherent and globally well distributed, suggesting scalable and robust patent representations. The work highlights the value of intra-document views for self-supervised patent understanding and provides reproducible code and evaluation resources.

Abstract

This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

Paper Structure

This paper contains 55 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An example of granted patent document (US Patent 9,914,063 B1). (Section structure and formatting may vary across jurisdictions.)
  • Figure 2: Training curves with dropout-only augmentation (input = TA). We report Precision@1 on IPC classification (KNN, $k{=}10$) in blue, and singular spectrum divergence (SSD) in red, evaluated every 250 steps. (The SSD is defined as the KL divergence between the normalized singular values of the embedding matrix and a uniform distribution.)
  • Figure 3: Embedding-space diagnostics. Each point shows alignment (x-axis) and uniformity (y-axis), dot size encodes normalized SSD, and color encodes prior-art retrieval performance (R@100, Abstract$\rightarrow$Abstract). Contours indicate an RBF-smoothed performance field within the convex hull of observed points.
  • Figure 4: Section distribution of top–100 retrieved documents in Claims$\rightarrow$All. Our section-augmented model retrieves a more balanced mix beyond claims, increasing the share of summary and background, which provide complementary discourse cues.
  • Figure 5: Embedding Space Diagnostics across three patent sections: Title+Abstract (TA), Claims, and Description. We report Alignment ↓, Uniformity ↓, and Singular Spectrum Divergence (SSD/log d) ↓. Lower values indicate better geometry.
  • ...and 1 more figures