Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram

Aehong Min; Xuan Wang; Rion Brattig Correia; Jordan Rozum; Wendy R. Miller; Luis M. Rocha

Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram

Aehong Min, Xuan Wang, Rion Brattig Correia, Jordan Rozum, Wendy R. Miller, Luis M. Rocha

TL;DR

This study shows that biomedical dictionaries derived from scientific discourse can misinterpret informal social media language in epilepsy contexts. It introduces a human-centered refinement workflow, annotating 1,771 Instagram posts to identify high-frequency false-positives, and removes eight terms with both high false-positive rates and frequency. Network analyses of co-mentions using the original versus refined dictionaries reveal substantial shifts in the top terms toward more clinically relevant epilepsy concepts, quantified via $K$ (Fagin's generalized Kendall's distance) and CER (common-element ratio). The authors also evaluate OpenAI's GPT-4 against human annotators, finding substantial disagreement and limited replacement capability, suggesting a hybrid approach and dataset-tailored dictionaries for reliable social-media biomedical surveillance. Overall, the work highlights the importance of human validation in dictionary curation to improve downstream knowledge graphs and disease-specific inferences from noisy social media data.

Abstract

We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.

Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram

TL;DR

(Fagin's generalized Kendall's distance) and CER (common-element ratio). The authors also evaluate OpenAI's GPT-4 against human annotators, finding substantial disagreement and limited replacement capability, suggesting a hybrid approach and dataset-tailored dictionaries for reliable social-media biomedical surveillance. Overall, the work highlights the importance of human validation in dictionary curation to improve downstream knowledge graphs and disease-specific inferences from noisy social media data.

Abstract

Paper Structure (27 sections, 5 figures, 25 tables)

This paper contains 27 sections, 5 figures, 25 tables.

Introduction
Data and Methods
Dictionary Construction
Data Collection and Post Tagging
Results
Human-centered Annotation
Identifying Ambiguous Terms
Impact of Removing Ambiguous Terms
Comparing Social Media with Medical and Scientific Discourse
The disagreement between GPT-4 and human annotators is significant
Discussion
Conclusion
Acknowledgement
Annotation process & analysis example
Dictionary refinement impact on knowledge networks
...and 12 more sections

Figures (5)

Figure 1: Manual annotation workflow
Figure 2: False-positive rate & frequency of parent terms in the annotated sample of posts. The dashed horizontal line depicts a false-positive rate of 50%.
Figure 3: Impact of removing the selected 8 terms on the top $k$ ($10 \leq k \leq 500$) highest eigenvector centrality terms by calculating the common elements ratio between the top $k$ term lists before and after term removal. For example, if after term removal, 5 terms are still in the top 10 terms lists, the common element ratio will be $0.5$ for $k=10$. The top $k$ eigenvector centrality terms lists were calculated on the Instagram co-mention network with or without term removal. For baseline comparison, we generated 1,000 samples ($N=8$) from terms with a False Positive Rate less than 0.5 and a frequency greater than or equal to the minimum frequency of the 8 selected terms in our Instagram data corpus, 10,230.
Figure S1: Screenshot of annotation guidelines
Figure S2: Screenshot of annotation analysis

Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram

TL;DR

Abstract

Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram

Authors

TL;DR

Abstract

Table of Contents

Figures (5)