PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

Pablo Cecilio; Antônio Perreira; Juliana Santos Rosa Viegas; Washington Cunha; Felipe Viegas; Elisa Tuler; Fabiana Testa Moura de Carvalho Vicentini; Leonardo Rocha

PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

Pablo Cecilio, Antônio Perreira, Juliana Santos Rosa Viegas, Washington Cunha, Felipe Viegas, Elisa Tuler, Fabiana Testa Moura de Carvalho Vicentini, Leonardo Rocha

TL;DR

PATopics tackles the labor-intensive task of mining pharmaceutical patents by automatically extracting semantically coherent topics from a large patent corpus and linking these topics to inventors, companies, and molecules. It combines CluWords-based data representation with Non-negative Matrix Factorization to produce topic-document and word-topic matrices, formalized as $A \approx H \times W$, where $A$ encodes textual representations, $H$ maps documents to topics, and $W$ maps words to topics. The framework then provides a web interface for rapid exploration, visualization, and cross-patent comparisons, enabling researchers, chemists, and firms to assess patent landscapes and portfolio opportunities. Validated on 4,832 patents spanning 809 molecules and 478 companies, PATopics demonstrates coherent topic structure, meaningful entity correlations, and practical case studies that highlight its potential to accelerate patent discovery and technology scouting in pharma.

Abstract

Pharmaceutical patents play an important role by protecting the innovation from copies but also drive researchers to innovate, create new products, and promote disruptive innovations focusing on collective health. The study of patent management usually refers to an exhaustive manual search. This happens, because patent documents are complex with a lot of details regarding the claims and methodology/results explanation of the invention. To mitigate the manual search, we proposed PATopics, a framework specially designed to extract relevant information for Pharmaceutical patents. PATopics is composed of four building blocks that extract textual information from the patents, build relevant topics that are capable of summarizing the patents, correlate these topics with useful patent characteristics and then, summarize the information in a friendly web interface to final users. The general contributions of PATopics are its ability to centralize patents and to manage patents into groups based on their similarities. We extensively analyzed the framework using 4,832 pharmaceutical patents concerning 809 molecules patented by 478 companies. In our analysis, we evaluate the use of the framework considering the demands of three user profiles -- researchers, chemists, and companies. We also designed four real-world use cases to evaluate the framework's applicability. Our analysis showed how practical and helpful PATopics are in the pharmaceutical scenario.

PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

TL;DR

, where

encodes textual representations,

maps documents to topics, and

maps words to topics. The framework then provides a web interface for rapid exploration, visualization, and cross-patent comparisons, enabling researchers, chemists, and firms to assess patent landscapes and portfolio opportunities. Validated on 4,832 patents spanning 809 molecules and 478 companies, PATopics demonstrates coherent topic structure, meaningful entity correlations, and practical case studies that highlight its potential to accelerate patent discovery and technology scouting in pharma.

Abstract

Paper Structure (20 sections, 5 figures, 5 tables)

This paper contains 20 sections, 5 figures, 5 tables.

Introduction
Materials and Methods
Framework construction
Data representation
Topic modeling decomposition
Correlation among entities
Summary interface
Data collecting and cleaning
Results and discussion
Framework overview
Topic analysis
Contributions
General contributions
Specific contributions
PATopics validation
...and 5 more sections

Figures (5)

Figure 1: Framework interface (A), which users access by login, and after logging, the homepage (B) presents an easily navigable interface showing the number of patents collected, companies involved, and related molecules and their respective inventors. The homepage also exhibits graphs of patents per year, from 2003 to 2021; patents per molecule, and patents per company. In the right exhibit, the topics words cloud and the most recent patents are included.
Figure 2: The Topics section (A) comprises a search bar and the generated topics by the word groups are described with the possibility of editable title and the number of patents per topic is visible; The companies' section (B) has a search bar (by the company) and they are distributed in the generated topics at 5, 10, 15 or 20 companies per topic; The molecules section (C) where the molecules mentioned per topic are highlighted.
Figure 3: Summary of insights and results regarding PATopics. Quantitative analysis between patents and (A) years, (B) companies, (C) molecules, and regarding (D) Topics covered by companies (%). (E) The main subjects involved in the collected pharmaceutical patents are Formulations and compositions, new compounds and prodrugs, chronic conditions, pain, clinical methods, devices, viral and cancer-related, dermatological, gastrointestinal, gene therapy, brain disorders, ophthalmic and nasal; showing with details the main patented (F) chronic conditions which are hypertension, cardiovascular-related, diabetes, chronic pain, genetic diseases and, cholesterol and triglycerides related.
Figure 4: The profile of potential users of PATopics framework and their main interest that the framework can engage in. The first profile is researchers who work with patents and their studies. The second is the chemists, who develop the patents and the third is companies and industries, who buy or use patents.
Figure 5: Comparison between patents in which included Naloxone as the drug, we can observe the points in common and differences between the patents regarding the same drug. The first patent is related to a liquid spray, the second is a sublingual and buccal film and the third is a pharmaceutical preparation mixing naloxone with other opioid.

PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

TL;DR

Abstract

PATopics: An automatic framework to extract useful information from pharmaceutical patents documents

Authors

TL;DR

Abstract

Table of Contents

Figures (5)