Table of Contents
Fetching ...

Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide

Zhijie Duan, Kai Wei, Zhaoqian Xue, Jiayan Zhou, Shu Yang, Siyuan Ma, Jin Jin, Lingyao li

TL;DR

This paper presents a scalable pipeline that leverages large language models to extract side-effect information from Reddit and construct a knowledge graph for semaglutide. The four-stage framework covers data collection, information extraction with prompt-based LLMs, KG construction with rich entity–relation metadata, and cross-source validation against FAERS using statistical tests. The resulting KG links four semaglutide-related medications to 1,775 side effects via 7,225 relations and 96 grouped terms, enabling both qualitative analyses and a quantitative cross-check with FDA data. The findings show general concordance with FAERS on common side effects while revealing Reddit-specific signals, including mental-health symptoms and rarer events, underscoring the value of patient-centered, crowdsourced real-world evidence for pharmacovigilance and highlighting the method’s generalizability to other drugs.

Abstract

Social media is a rich source of real-world data that captures valuable patient experience information for pharmacovigilance. However, mining data from unstructured and noisy social media content remains a challenging task. We present a systematic framework that leverages large language models (LLMs) to extract medication side effects from social media and organize them into a knowledge graph (KG). We apply this framework to semaglutide for weight loss using data from Reddit. Using the constructed knowledge graph, we perform comprehensive analyses to investigate reported side effects across different semaglutide brands over time. These findings are further validated through comparison with adverse events reported in the FAERS database, providing important patient-centered insights into semaglutide's side effects that complement its safety profile and current knowledge base of semaglutide for both healthcare professionals and patients. Our work demonstrates the feasibility of using LLMs to transform social media data into structured KGs for pharmacovigilance.

Crowdsourcing-Based Knowledge Graph Construction for Drug Side Effects Using Large Language Models with an Application on Semaglutide

TL;DR

This paper presents a scalable pipeline that leverages large language models to extract side-effect information from Reddit and construct a knowledge graph for semaglutide. The four-stage framework covers data collection, information extraction with prompt-based LLMs, KG construction with rich entity–relation metadata, and cross-source validation against FAERS using statistical tests. The resulting KG links four semaglutide-related medications to 1,775 side effects via 7,225 relations and 96 grouped terms, enabling both qualitative analyses and a quantitative cross-check with FDA data. The findings show general concordance with FAERS on common side effects while revealing Reddit-specific signals, including mental-health symptoms and rarer events, underscoring the value of patient-centered, crowdsourced real-world evidence for pharmacovigilance and highlighting the method’s generalizability to other drugs.

Abstract

Social media is a rich source of real-world data that captures valuable patient experience information for pharmacovigilance. However, mining data from unstructured and noisy social media content remains a challenging task. We present a systematic framework that leverages large language models (LLMs) to extract medication side effects from social media and organize them into a knowledge graph (KG). We apply this framework to semaglutide for weight loss using data from Reddit. Using the constructed knowledge graph, we perform comprehensive analyses to investigate reported side effects across different semaglutide brands over time. These findings are further validated through comparison with adverse events reported in the FAERS database, providing important patient-centered insights into semaglutide's side effects that complement its safety profile and current knowledge base of semaglutide for both healthcare professionals and patients. Our work demonstrates the feasibility of using LLMs to transform social media data into structured KGs for pharmacovigilance.

Paper Structure

This paper contains 12 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The conceptualized framework of the pipeline applied to semaglutide.
  • Figure 2: Interactive knowledge graph visualization of semaglutide side effect information extracted from Reddit data. A) The knowledge graph displaying relationships between four medication entities and their associated side effects, with node sizes proportional to the frequency of Reddit mentions. B) Focused view of the knowledge graph when selecting nausea as a single side effect, showing detailed statistics on severity, duration, and dosage, along with representative user experiences across different medication brands.
  • Figure 3: An overview of GPT-extracted Reddit post information on semaglutide. A) Number of Reddit mentions of semaglutide side effects by brand between January 1st, 2020, and January 31st, 2025. B) Top 10 GPT-extracted side effects mentioned for each brand of semaglutide. C) Overlapping top side effects extracted across brands. D) Top 15 GPT-extracted side effects weighted by GPT-extracted severity across brands. The point size indicates the number of mentions. Rybelsus is not displayed in the figure because none of the top 15 side effects with comments on severity were mentioned for Rybelsus.
  • Figure 4: Crowd-sourced side effect surveillance based on Reddit posts for semaglutide demonstrates mostly consistent findings with FDA-registered AEs, with additional unique discoveries. A) Correspondence between the top 20 side effects identified by our Reddit-based knowledge graph and top AEs registered by the FDA. B) Frequency of top events as measured by the FDA compared to Reddit posts largely also agree with each other (Spearman correlation of per-adverse-event frequencies=0.423). C) Binomial logistic regression formally established events that Reddit crowdsourcing differentially detected compared to FDA (log odds ratio; statistical significance determined with p<0.05 post-Bonferroni correction across AEs). D) Brand-specific analysis of the differential pattern of Reddit crowdsourcing versus FDA surveillance.