Table of Contents
Fetching ...

Focused digital cohort selection from social media using the metric backbone of biomedical knowledge graphs

Ziqi Guo, Jack Felag, Jordan C. Rozum, Rion Brattig Correia, Xuan Wang, Luis M. Rocha

TL;DR

The paper tackles the challenge of forming topic-focused digital cohorts from noisy social media by introducing a general, platform-agnostic method based on the metric backbone of biomedical knowledge graphs (KGs). It builds platform-specific KGs from a curated dictionary of biomedical terms, converts term co-occurrence into a distance space with $d_{ij} = 1/p_{ij} - 1$, and sparsifies the KG to a backbone that preserves all shortest-path relations. Users who contribute to the KG backbone (backbone contributors) form focused digital cohorts, with epilepsy-focused platforms yielding much higher backbone participation (≈93–95%) than general-purpose sites (≈65–72%), and backbone filtering reducing false positives compared with engagement-based methods. The approach reliably yields more biologically relevant cohorts, scales across platforms, and is generalizable to other conditions by updating the dictionary, offering a practical path to robust, interpretable social-media–driven biomedical inference. The method improves cohort relevance and reduces noise, enabling safer, scalable studies of treatment effects and patient experiences from online discourse. Key findings show substantial sparsification of KGs without loss of shortest-path information and superior discrimination of biomedical relevance versus misused terms. The work provides publicly available KGs and a scalable blueprint for future multi-platform health social-media research.

Abstract

Social media data allows researchers to construct large digital cohorts to study the interplay between human behavior and medical treatment.Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. To filter relevant users on any social media, we have developed a general method and tested it on epilepsy discourse. We analyzed the text from posts by users who mention epilepsy drugs at least once in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We used a curated medical terminology dictionary to generate a knowledge graph (KG) from each social media site, whereby nodes represent terms, and edge weights denote the strength of association between pairs of terms in the collected text. Our method is based on computing the metric backbone of each KG, which yields the subgraph of edges that participate in shortest paths. By comparing the subset of users who contribute to the backbone to the subset who do not, we show that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrate that users who do not contribute to the backbone are much more likely to use dictionary terms in a manner inconsistent with their biomedical meaning and are rightly excluded from the cohort of interest.

Focused digital cohort selection from social media using the metric backbone of biomedical knowledge graphs

TL;DR

The paper tackles the challenge of forming topic-focused digital cohorts from noisy social media by introducing a general, platform-agnostic method based on the metric backbone of biomedical knowledge graphs (KGs). It builds platform-specific KGs from a curated dictionary of biomedical terms, converts term co-occurrence into a distance space with , and sparsifies the KG to a backbone that preserves all shortest-path relations. Users who contribute to the KG backbone (backbone contributors) form focused digital cohorts, with epilepsy-focused platforms yielding much higher backbone participation (≈93–95%) than general-purpose sites (≈65–72%), and backbone filtering reducing false positives compared with engagement-based methods. The approach reliably yields more biologically relevant cohorts, scales across platforms, and is generalizable to other conditions by updating the dictionary, offering a practical path to robust, interpretable social-media–driven biomedical inference. The method improves cohort relevance and reduces noise, enabling safer, scalable studies of treatment effects and patient experiences from online discourse. Key findings show substantial sparsification of KGs without loss of shortest-path information and superior discrimination of biomedical relevance versus misused terms. The work provides publicly available KGs and a scalable blueprint for future multi-platform health social-media research.

Abstract

Social media data allows researchers to construct large digital cohorts to study the interplay between human behavior and medical treatment.Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. To filter relevant users on any social media, we have developed a general method and tested it on epilepsy discourse. We analyzed the text from posts by users who mention epilepsy drugs at least once in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We used a curated medical terminology dictionary to generate a knowledge graph (KG) from each social media site, whereby nodes represent terms, and edge weights denote the strength of association between pairs of terms in the collected text. Our method is based on computing the metric backbone of each KG, which yields the subgraph of edges that participate in shortest paths. By comparing the subset of users who contribute to the backbone to the subset who do not, we show that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrate that users who do not contribute to the backbone are much more likely to use dictionary terms in a manner inconsistent with their biomedical meaning and are rightly excluded from the cohort of interest.
Paper Structure (17 sections, 5 figures, 3 tables)

This paper contains 17 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Ego network reduction using the metric backbone of the Instagram KG a: The full ego network for the term "Depression" of the Instagram KG, a condition highly co-morbid with epilepsy, that shows the subnetwork of nodes directly connected to "Depression" in the Instagram KG. Rendered using a ForceAtlas2 jacomy2014forceatlas2 layout. b: The metric backbone of the "Depression" ego network of the Instagram KG. c: Small subgraph of full "Depression" ego network shown in a, with five dictionary terms. Distance edge weights are obtained from term co-occurrence in posts (see Section \ref{['knowledge network']} for calculation details), with thickness rendered inverse proportionally to distance. d: The metric backbone of the network in panel c; see text for additional details and examples.
  • Figure 2: Social media KGs and their metric backbones. For each data source, the depiction on the left is the whole network, and on the right is the metric backbone subgraph. The relative size of the metric backbone is shown as the percentage of edges kept (all nodes are kept in the metric backbone). Dictionary terms can be associated with one or more of the four categories: drugs, medical terms, food/allergens, and other natural products as described in Section \ref{['dictionary']} with the colors shown in legend. These categories are used primarily for visualization purposes and do not impact our analysis. In case a term belongs to one or more classes, it is assigned to only one class in the network visualizations with the following preference order: drugs, medical terms, food/allergens, natural products. The nodes are sized according to their (unweighted) degree in the original network. The r/Epilepsy and EFA networks are computed only from their drug-mention subcohorts to better compare them with the X and Instagram networks. Node positions are determined by the Fruchterman Reingold method applied to the backbone networks.
  • Figure 3: A backbone-based filtering example. First, we curate a medical dictionary, which includes terms related to drugs, allergens, medical terms, and natural products (see section \ref{['dictionary']}). Then, we collect posts from users on social media and match these posts with the dictionary terms (see section \ref{['data collect']}). The matched terms are represented by $m$ in the figure. Second, we build the KG, wherein the nodes represent the medical dictionary terms and the edge weights denote the likelihood that the connected pair of terms occur within same post (see section \ref{['knowledge network']}). Third, from the KG we compute the metric backbone (see section \ref{['backbone']}). Finally, we identify the backbone contributors.
  • Figure 4: Proportions of users retained by backbone and engagement filters. High, medium, and low engagement refer to the filtering criteria of aggressive, lenient, and no filtering, respectively. Horizontal axes denote derived cohorts according to engagement filtering, and vertical axes denote derived cohorts according to backbone filtering. Each subset represents, for example, in the top left panel, 70.8% (4,219) of users in the X cohort are Backbone Contributors, 1.7% ($4,219 \times 1.7\% = 72$) of whom are high-engagement as well. The raw count of each subset can be found in Figure .6 in Supplementary Material.
  • Figure 5: Study of incorrect term usage per filtering method using a human-annotated corpus of Instagram posts. The False Positive Ratio (FPR) is the proportion of annotated terms from a given cohort (retained or not retained) that are used in contexts unrelated to biomedical inference. The $p\text{-value}$ above each pair of bars indicate the statistical significance of the difference in FPR between retained and not retained users, where $**$ indicates $p < 0.01$ and $*$ indicates $p < 0.05$. The confidence intervals for the difference in FPR between retained and not retained users are 11.4% $\pm$ 9.1%, 4.0% $\pm$ 3.3%, 4.3% $\pm$ 3.3%, for backbone, lenient, and aggressive filters, respectively, based on a 95% confidence level. The bottom horizontal bars present the number of retained and not retained users for each filter.