Table of Contents
Fetching ...

Social Media as a Sensor: Analyzing Twitter Data for Breast Cancer Medication Effects Using Natural Language Processing

Seibi Kobara, Alireza Rafiei, Masoud Nateghi, Selen Bozkurt, Rishikesan Kamaleswaran, Abeed Sarker

TL;DR

The paper addresses the paucity of patient-reported outcome data in EHRs and the potential of social media to illuminate breast cancer medication experiences. It develops an NLP pipeline that uses a transformer classifier to automatically identify self-reported breast cancer posts on X/Twitter, then applies a two-layer rule-based system with Levenshtein matching to extract medication mentions and side effects. On a dataset of $1{,}454{,}637$ posts from $583{,}962$ users, the approach identifies $62{,}042$ breast cancer members, with $198$ mentioning FDA-approved medications (tamoxifen most common) and a side-effect lexicon uncovering $31$ side effects, including a novel generalized emotion category. Statistical analyses reveal significant associations between medication patterns and side effects, supporting the feasibility of social-media–driven pharmacovigilance and early signal detection, and suggesting directions for scaling to larger cohorts and additional platforms.

Abstract

Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resource for enhancing our understanding of the patients' treatment experiences. In this paper, we developed natural language processing (NLP) based methodologies to study information posted by an automatically curated breast cancer cohort from social media. We employed a transformer-based classifier to identify breast cancer patients/survivors on X (Twitter) based on their self-reported information, and we collected longitudinal data from their profiles. We then designed a multi-layer rule-based model to develop a breast cancer therapy-associated side effect lexicon and detect patterns of medication usage and associated side effects among breast cancer patients. 1,454,637 posts were available from 583,962 unique users, of which 62,042 were detected as breast cancer members using our transformer-based model. 198 cohort members mentioned breast cancer medications with tamoxifen as the most common. Our side effect lexicon identified well-known side effects of hormone and chemotherapy. Furthermore, it discovered a subject feeling towards cancer and medications, which may suggest a pre-clinical phase of side effects or emotional distress. This analysis highlighted not only the utility of NLP techniques in unstructured social media data to identify self-reported breast cancer posts, medication usage patterns, and treatment side effects but also the richness of social data on such clinical questions.

Social Media as a Sensor: Analyzing Twitter Data for Breast Cancer Medication Effects Using Natural Language Processing

TL;DR

The paper addresses the paucity of patient-reported outcome data in EHRs and the potential of social media to illuminate breast cancer medication experiences. It develops an NLP pipeline that uses a transformer classifier to automatically identify self-reported breast cancer posts on X/Twitter, then applies a two-layer rule-based system with Levenshtein matching to extract medication mentions and side effects. On a dataset of posts from users, the approach identifies breast cancer members, with mentioning FDA-approved medications (tamoxifen most common) and a side-effect lexicon uncovering side effects, including a novel generalized emotion category. Statistical analyses reveal significant associations between medication patterns and side effects, supporting the feasibility of social-media–driven pharmacovigilance and early signal detection, and suggesting directions for scaling to larger cohorts and additional platforms.

Abstract

Breast cancer is a significant public health concern and is the leading cause of cancer-related deaths among women. Despite advances in breast cancer treatments, medication non-adherence remains a major problem. As electronic health records do not typically capture patient-reported outcomes that may reveal information about medication-related experiences, social media presents an attractive resource for enhancing our understanding of the patients' treatment experiences. In this paper, we developed natural language processing (NLP) based methodologies to study information posted by an automatically curated breast cancer cohort from social media. We employed a transformer-based classifier to identify breast cancer patients/survivors on X (Twitter) based on their self-reported information, and we collected longitudinal data from their profiles. We then designed a multi-layer rule-based model to develop a breast cancer therapy-associated side effect lexicon and detect patterns of medication usage and associated side effects among breast cancer patients. 1,454,637 posts were available from 583,962 unique users, of which 62,042 were detected as breast cancer members using our transformer-based model. 198 cohort members mentioned breast cancer medications with tamoxifen as the most common. Our side effect lexicon identified well-known side effects of hormone and chemotherapy. Furthermore, it discovered a subject feeling towards cancer and medications, which may suggest a pre-clinical phase of side effects or emotional distress. This analysis highlighted not only the utility of NLP techniques in unstructured social media data to identify self-reported breast cancer posts, medication usage patterns, and treatment side effects but also the richness of social data on such clinical questions.
Paper Structure (10 sections, 4 figures)

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: Flow diagram of the methods for medication and their associated side effects discovery from the social media cohort data. Abbreviations: RBM, rule-based model.
  • Figure 2: Top 10 most expressed breast cancer approved medications in our social media cohort. The label on top of the bar charts represents the number of cohort members who expressed medications.
  • Figure 3: Expressed side effects in our social medial cohort. The y-axis represents the proportions and the text labels on top of bar charts are the number of users who expressed side effects. NEC, not elsewhere classified.
  • Figure 4: The heat map of prevalence of significantly associated side effects with medication patterns (adjusted p-value $<$ 0.05). NEC, not elsewhere classified.