Table of Contents
Fetching ...

Intelligent System for Automated Molecular Patent Infringement Assessment

Yaorui Shi, Sihang Li, Taiyan Zhang, Xi Fang, Jiankun Wang, Zhiyuan Liu, Guojiang Zhao, Zhengdan Zhu, Zhifeng Gao, Renxin Zhong, Linfeng Zhang, Guolin Ke, Weinan E, Hengxing Cai, Xiang Wang

TL;DR

This work introduces PatentFinder, a multi-agent system that decomposes automated molecular patent infringement assessment into specialized subtasks handled by tool-enabled agents, addressing limitations of large language models in interpreting complex Markush structures. It combines two neural tools (MarkushMatcher and MarkushParser) with a benchmark dataset MolPatent-240 to demonstrate improved accuracy (notably a 13.8% F1 and 12% accuracy gain over baselines) and increased interpretability via autonomous infringement reports. The approach is validated through extensive experiments, including evaluations of the Markush tools and case studies that illustrate reduced hallucinations and clearer reasoning paths. The MolPatent-240 dataset and the toolchain enable robust, scalable patent-protection analysis integrated into AI-driven drug discovery, with potential applicability to other scientific workflows.

Abstract

Automated drug discovery offers significant potential for accelerating the development of novel therapeutics by substituting labor-intensive human workflows with machine-driven processes. However, molecules generated by artificial intelligence may unintentionally infringe on existing patents, posing legal and financial risks that impede the full automation of drug discovery pipelines. This paper introduces PatentFinder, a novel multi-agent and tool-enhanced intelligence system that can accurately and comprehensively evaluate small molecules for patent infringement. PatentFinder features five specialized agents that collaboratively analyze patent claims and molecular structures with heuristic and model-based tools, generating interpretable infringement reports. To support systematic evaluation, we curate MolPatent-240, a benchmark dataset tailored for patent infringement assessment algorithms. On this benchmark, PatentFinder outperforms baseline methods that rely solely on large language models or specialized chemical tools, achieving a 13.8% improvement in F1-score and a 12% increase in accuracy. Additionally, PatentFinder autonomously generates detailed and interpretable patent infringement reports, showcasing enhanced accuracy and improved interpretability. The high accuracy and interpretability of PatentFinder make it a valuable and reliable tool for automating patent infringement assessments, offering a practical solution for integrating patent protection analysis into the drug discovery pipeline.

Intelligent System for Automated Molecular Patent Infringement Assessment

TL;DR

This work introduces PatentFinder, a multi-agent system that decomposes automated molecular patent infringement assessment into specialized subtasks handled by tool-enabled agents, addressing limitations of large language models in interpreting complex Markush structures. It combines two neural tools (MarkushMatcher and MarkushParser) with a benchmark dataset MolPatent-240 to demonstrate improved accuracy (notably a 13.8% F1 and 12% accuracy gain over baselines) and increased interpretability via autonomous infringement reports. The approach is validated through extensive experiments, including evaluations of the Markush tools and case studies that illustrate reduced hallucinations and clearer reasoning paths. The MolPatent-240 dataset and the toolchain enable robust, scalable patent-protection analysis integrated into AI-driven drug discovery, with potential applicability to other scientific workflows.

Abstract

Automated drug discovery offers significant potential for accelerating the development of novel therapeutics by substituting labor-intensive human workflows with machine-driven processes. However, molecules generated by artificial intelligence may unintentionally infringe on existing patents, posing legal and financial risks that impede the full automation of drug discovery pipelines. This paper introduces PatentFinder, a novel multi-agent and tool-enhanced intelligence system that can accurately and comprehensively evaluate small molecules for patent infringement. PatentFinder features five specialized agents that collaboratively analyze patent claims and molecular structures with heuristic and model-based tools, generating interpretable infringement reports. To support systematic evaluation, we curate MolPatent-240, a benchmark dataset tailored for patent infringement assessment algorithms. On this benchmark, PatentFinder outperforms baseline methods that rely solely on large language models or specialized chemical tools, achieving a 13.8% improvement in F1-score and a 12% increase in accuracy. Additionally, PatentFinder autonomously generates detailed and interpretable patent infringement reports, showcasing enhanced accuracy and improved interpretability. The high accuracy and interpretability of PatentFinder make it a valuable and reliable tool for automating patent infringement assessments, offering a practical solution for integrating patent protection analysis into the drug discovery pipeline.

Paper Structure

This paper contains 23 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of PatentFinder. Patent Finder is a multi-agent framework for autonomous molecular patent infringement assessment. a) Planner coordinates subtasks among agents and compiles a comprehensive infringement report. b) Sketch Extractor extracts the patent's core Markush structures and associated claim requirements. c) Substituents Matcher identifies and validates substituent groups in Markush expressions relative to the query molecule. d) Requirements Examinator assesses whether the query molecule meets the patent’s substituent group requirements. e) Fact Checker verifies agents’ outputs against original claims and corrects discrepancies for accuracy.
  • Figure 2: Case study on MolPatent-240. a) A random sample selected from MolPatent-240, the molecule is not protected by the patent. b, c, d) The prediction output of three baseline language models, both implemented with Patent Text + Markush String paradigm. e) The inference process of PatentFinder, in which different agents solve the subtasks separately, and their results are summarized into an infringement report. For clarity, some of the outputs are redacted.
  • Figure 3: Performance comparison between different substituent group extraction methods. a) Average Tanimoto similarity of different matching algorithms. b) Levenshtein distances of different matching algorithms. c) Exact match accuracy and chemical validity of different matching algorithms. d) The accuracy, validity, and training loss of MarkushMatcher at different training steps. e) Statistics of the substituent groups used in training MarkushMatcher, where the groups with a single heavy atom (C, N, O) appear most frequently. f, g) The exchangeable groups and adjacent groups make the substituent group matching task unsolvable without additional textual description, which indicates the necessity of including contexts in the patent. GT: Ground Truth; Alt: Alternative solution.
  • Figure 4: Illustration of the development of MarkushMatcher. a) We construct Markush matching data by designing a reverse data generation algorithm by attaching the substituent groups onto the Markush Skeletons. b) During training, the MarkushMatcher model takes the Markush structure and the query molecule as input, and is trained to predict the values of each substituent group defined in the Markush structure.
  • Figure 5: Case Studies of MarkushParser. a) Comparison between different Markush input modes: Markush string and Markush image. Both groups use Gemini as the Backbone Language model. b) Markush structure reconstruction results of MarkushParser.
  • ...and 4 more figures