Table of Contents
Fetching ...

PatentVision: A multimodal method for drafting patent applications

Ruo Yang, Sai Krishna Reddy Mudhiganti, Manali Sharma

TL;DR

The paper addresses the challenge of drafting comprehensive patent specifications by integrating textual claims with visual diagrams through multimodal large vision-language models. It introduces PatentVision, which extends prior text-only approaches by embedding structured context from claims, figures, and component mappings, and enabling interactive user guidance. Empirical results on a CPC G06F-focused dataset show that PatentVision outperforms the text-only PatentFormer and benefits substantially from fine-tuning and higher image resolution. The work highlights the practical potential of LVLM-based multimodal drafting for more consistent, scalable IP documentation and lays groundwork for broader, specialized applications of multimodal AI in technical domains.

Abstract

Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs such as patent claims and drawings to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine tuned vision language models with domain specific training tailored to patents. Experiments reveal it surpasses text only methods, producing outputs with greater fidelity and alignment with human written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays the groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.

PatentVision: A multimodal method for drafting patent applications

TL;DR

The paper addresses the challenge of drafting comprehensive patent specifications by integrating textual claims with visual diagrams through multimodal large vision-language models. It introduces PatentVision, which extends prior text-only approaches by embedding structured context from claims, figures, and component mappings, and enabling interactive user guidance. Empirical results on a CPC G06F-focused dataset show that PatentVision outperforms the text-only PatentFormer and benefits substantially from fine-tuning and higher image resolution. The work highlights the practical potential of LVLM-based multimodal drafting for more consistent, scalable IP documentation and lays groundwork for broader, specialized applications of multimodal AI in technical domains.

Abstract

Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs such as patent claims and drawings to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine tuned vision language models with domain specific training tailored to patents. Experiments reveal it surpasses text only methods, producing outputs with greater fidelity and alignment with human written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays the groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.

Paper Structure

This paper contains 12 sections, 7 figures.

Figures (7)

  • Figure 1: PatentVision is a framework that generates high-quality patent specifications using multimodal inputs like images, patent claims, and optional figure descriptions. Specifically, PatentVision integrates three inputs: the image, enriched textual content derived from PatentFormer’s text processing pipeline PatentFormer, and an instruction prompt tailored for the base vision-language model. The vision-language model is fine-tuned on domain-specific patent data to learn and replicate the formal writing style typical of patent specifications, thereby assisting patent authors in drafting coherent and contextually appropriate descriptions.
  • Figure 2: PatentFormer PatentFormer performs text processing by taking as input the image $I$, the claim $C$, and the image description $B$. It outputs an enriched textual representation containing structured tokens such as $<$$comp_{\_}name$$>$, which are subsequently encoded using the tokenizer of the language model. These enriched tokens provide explicit semantic anchors that facilitate more accurate and context-aware specification generation.
  • Figure 3: Comparison between PatentVision with different base LVLMs and LoRA ranks and PatentFormer.
  • Figure 4: Performance of PatentVision with different base LVLMs compared to their pretrained versions.
  • Figure 5: Performance of PatentVision with Gemma 3 as base model trained with varying epochs and LoRA ranks.
  • ...and 2 more figures