Table of Contents
Fetching ...

A Comparative Study on Patient Language across Therapeutic Domains for Effective Patient Voice Classification in Online Health Discussions

Giorgos Lysandrou, Roma English Owen, Vanja Popovic, Grant Le Brun, Aryo Pradipta Gema, Beatrice Alex, Elizabeth A. L. Fairley

TL;DR

This study analyzes the importance of linguistic characteristics in accurately classifying patient voices and fine-tuned a pre-trained Language Model on the combined datasets with similar linguistic patterns, resulting in a highly accurate automatic patient voice classification.

Abstract

There exists an invisible barrier between healthcare professionals' perception of a patient's clinical experience and the reality. This barrier may be induced by the environment that hinders patients from sharing their experiences openly with healthcare professionals. As patients are observed to discuss and exchange knowledge more candidly on social media, valuable insights can be leveraged from these platforms. However, the abundance of non-patient posts on social media necessitates filtering out such irrelevant content to distinguish the genuine voices of patients, a task we refer to as patient voice classification. In this study, we analyse the importance of linguistic characteristics in accurately classifying patient voices. Our findings underscore the essential role of linguistic and statistical text similarity analysis in identifying common patterns among patient groups. These results allude to even starker differences in the way patients express themselves at a disease level and across various therapeutic domains. Additionally, we fine-tuned a pre-trained Language Model on the combined datasets with similar linguistic patterns, resulting in a highly accurate automatic patient voice classification. Being the pioneering study on the topic, our focus on extracting authentic patient experiences from social media stands as a crucial step towards advancing healthcare standards and fostering a patient-centric approach.

A Comparative Study on Patient Language across Therapeutic Domains for Effective Patient Voice Classification in Online Health Discussions

TL;DR

This study analyzes the importance of linguistic characteristics in accurately classifying patient voices and fine-tuned a pre-trained Language Model on the combined datasets with similar linguistic patterns, resulting in a highly accurate automatic patient voice classification.

Abstract

There exists an invisible barrier between healthcare professionals' perception of a patient's clinical experience and the reality. This barrier may be induced by the environment that hinders patients from sharing their experiences openly with healthcare professionals. As patients are observed to discuss and exchange knowledge more candidly on social media, valuable insights can be leveraged from these platforms. However, the abundance of non-patient posts on social media necessitates filtering out such irrelevant content to distinguish the genuine voices of patients, a task we refer to as patient voice classification. In this study, we analyse the importance of linguistic characteristics in accurately classifying patient voices. Our findings underscore the essential role of linguistic and statistical text similarity analysis in identifying common patterns among patient groups. These results allude to even starker differences in the way patients express themselves at a disease level and across various therapeutic domains. Additionally, we fine-tuned a pre-trained Language Model on the combined datasets with similar linguistic patterns, resulting in a highly accurate automatic patient voice classification. Being the pioneering study on the topic, our focus on extracting authentic patient experiences from social media stands as a crucial step towards advancing healthcare standards and fostering a patient-centric approach.
Paper Structure (29 sections, 2 equations, 4 figures, 5 tables)

This paper contains 29 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Experiment data volumes across data sources and splits, grouped by the therapeutic domains. The labels are "Patient Voice" and "Not Relevant".
  • Figure 2: The pipeline for classifying patient voices from online health communities. Starting with data acquisition from Reddit and SocialGist, the methodology encompasses a sequential process from annotating the data, partitioning into train-validation-test splits, comprehensive linguistic and textual analyses, as well as the training of NLP classifier models.
  • Figure 3: Pairwise comparison matrix of cosine similarity values between all data source and therapeutic domain-specific subsets of the data, after TF-IDF analysis.
  • Figure 4: Heatmaps of McNemar's test p-values for pairwise comparisons of classifiers. Each heatmap corresponds to a specific test dataset. P-values greater than 0.05 suggest no significant difference in classifier performance.