Table of Contents
Fetching ...

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

TL;DR

This paper introduces Reddit-Impacts, a specialized NER dataset drawn from Reddit discussions about opioid use to capture clinical and social impacts of nonmedical substance use. It details a rigorous data collection and annotation process across 30 entity types, producing a sparsely occurring but high-signal set of clinical and social impact annotations. The authors benchmark transformer models (BERT, RoBERTa) alongside a few-shot method (DANN) and GPT-3.5 in one-shot settings, highlighting the challenges posed by data sparsity and the relative strengths of DANN and GPT-3.5 in limited-data regimes. The work delivers a valuable resource for public health informatics and NLP research, with implications for automatic extraction of health and societal signals from social media and potential guidance for future model development and real-world deployment.

Abstract

Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

TL;DR

This paper introduces Reddit-Impacts, a specialized NER dataset drawn from Reddit discussions about opioid use to capture clinical and social impacts of nonmedical substance use. It details a rigorous data collection and annotation process across 30 entity types, producing a sparsely occurring but high-signal set of clinical and social impact annotations. The authors benchmark transformer models (BERT, RoBERTa) alongside a few-shot method (DANN) and GPT-3.5 in one-shot settings, highlighting the challenges posed by data sparsity and the relative strengths of DANN and GPT-3.5 in limited-data regimes. The work delivers a valuable resource for public health informatics and NLP research, with implications for automatic extraction of health and societal signals from social media and potential guidance for future model development and real-world deployment.

Abstract

Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.
Paper Structure (12 sections, 1 figure, 3 tables)

This paper contains 12 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Entity types and the number of posts in each entity type.