Table of Contents
Fetching ...

PatentGPT: A Large Language Model for Intellectual Property

Zilong Bai, Ruiji Zhang, Linqing Chen, Qijun Cai, Yuan Zhong, Cong Wang, Yan Fang, Jie Fang, Jing Sun, Weikuan Wang, Lizhi Zhou, Haoran Hua, Tian Qiu, Chaochao Wang, Cheng Sun, Jianping Lu, Yixin Wang, Yubin Xia, Meng Hu, Haowen Liu, Peng Xu, Licong Xu, Fu Bian, Xiaolong Gu, Lisha Zhang, Weilei Wang, Changyang Tu

TL;DR

PatentGPT presents a standardized, cost-conscious approach to pretraining and aligning large language models for the IP domain, leveraging open-source bases (LLaMA2, Mixtral) and a two-stage, multilingual pretraining regime on ~240B IP-focused tokens. Through SFT and RLHF, the authors tailor the models to IP tasks, and introduce PatentBench to evaluate patent drafting, classification, translation, and reasoning. Empirical results show PatentGPT variants, especially the SMoE-based PatentGPT-1.0-MoE, achieving competitive or superior performance to GPT-4 on IP benchmarks and exam-style challenges, while offering favorable long-context efficiency. The work highlights the practicality of domain-specific LLMs for IP tasks and suggests avenues for longer-context support and English data augmentation to further enhance performance and applicability in real-world IP workflows.

Abstract

In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP domain. Remarkably, our model surpassed GPT-4 on the 2019 China Patent Agent Qualification Examination, scoring 65 and matching human expert levels. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain.

PatentGPT: A Large Language Model for Intellectual Property

TL;DR

PatentGPT presents a standardized, cost-conscious approach to pretraining and aligning large language models for the IP domain, leveraging open-source bases (LLaMA2, Mixtral) and a two-stage, multilingual pretraining regime on ~240B IP-focused tokens. Through SFT and RLHF, the authors tailor the models to IP tasks, and introduce PatentBench to evaluate patent drafting, classification, translation, and reasoning. Empirical results show PatentGPT variants, especially the SMoE-based PatentGPT-1.0-MoE, achieving competitive or superior performance to GPT-4 on IP benchmarks and exam-style challenges, while offering favorable long-context efficiency. The work highlights the practicality of domain-specific LLMs for IP tasks and suggests avenues for longer-context support and English data augmentation to further enhance performance and applicability in real-world IP workflows.

Abstract

In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP domain. Remarkably, our model surpassed GPT-4 on the 2019 China Patent Agent Qualification Examination, scoring 65 and matching human expert levels. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain.
Paper Structure (16 sections, 2 equations, 9 figures, 9 tables)

This paper contains 16 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The distribution of different categories of pretraining data for PatentGPT models.
  • Figure 2: The proportion of different types of data used in each pretraining stage compared to the total amount of the corresponding type of pretraining data.
  • Figure 3: The workflow for conducting SFT and RLHF on PatentGPT models
  • Figure 4: Zero-shot performance of PatentGPT models on PatentBench: The left panel illustrates patent summarization, drafting and IP-oriented open question answering capabilities of PatentGPT models in comparison with GPT-3.5-turbo, as evaluated automatically by GPT-4. The right panel shows the classification, examination, translation, text correction, and reasoning abilities assessed based on metrics widely used in NLP.
  • Figure 5: The performance of all models on the 2019 China Patent Agent Qualification Examination and their corresponding PPA scores.
  • ...and 4 more figures