Table of Contents
Fetching ...

Improving Disease Detection from Social Media Text via Self-Augmentation and Contrastive Learning

Pervaiz Iqbal Khan, Andreas Dengel, Sheraz Ahmed

TL;DR

The paper tackles disease detection from social media, where figurative language and data sparsity hinder robust representation learning. It introduces a self-augmentation framework with a two-stream Transformer setup and a shared projection for contrastive learning, jointly optimizing two classification losses and a contrastive loss to align original and augmented representations. Across three public datasets (Dreaddit, RHMD, DepressionEmo), the approach achieves state-of-the-art or near-SotA results, with up to 2.48 percentage points improvements in F1 and notable gains when combined with RoBERTa, supported by ablation and visualization analyses. The work advances robust, generalizable text representations for health-related NLP, with implications for public health monitoring and disease surveillance from social media data.

Abstract

Detecting diseases from social media has diverse applications, such as public health monitoring and disease spread detection. While language models (LMs) have shown promising performance in this domain, there remains ongoing research aimed at refining their discriminating representations. In this paper, we propose a novel method that integrates Contrastive Learning (CL) with language modeling to address this challenge. Our approach introduces a self-augmentation method, wherein hidden representations of the model are augmented with their own representations. This method comprises two branches: the first branch, a traditional LM, learns features specific to the given data, while the second branch incorporates augmented representations from the first branch to encourage generalization. CL further refines these representations by pulling pairs of original and augmented versions closer while pushing other samples away. We evaluate our method on three NLP datasets encompassing binary, multi-label, and multi-class classification tasks involving social media posts related to various diseases. Our approach demonstrates notable improvements over traditional fine-tuning methods, achieving up to a 2.48% increase in F1-score compared to baseline approaches and a 2.1% enhancement over state-of-the-art methods.

Improving Disease Detection from Social Media Text via Self-Augmentation and Contrastive Learning

TL;DR

The paper tackles disease detection from social media, where figurative language and data sparsity hinder robust representation learning. It introduces a self-augmentation framework with a two-stream Transformer setup and a shared projection for contrastive learning, jointly optimizing two classification losses and a contrastive loss to align original and augmented representations. Across three public datasets (Dreaddit, RHMD, DepressionEmo), the approach achieves state-of-the-art or near-SotA results, with up to 2.48 percentage points improvements in F1 and notable gains when combined with RoBERTa, supported by ablation and visualization analyses. The work advances robust, generalizable text representations for health-related NLP, with implications for public health monitoring and disease surveillance from social media data.

Abstract

Detecting diseases from social media has diverse applications, such as public health monitoring and disease spread detection. While language models (LMs) have shown promising performance in this domain, there remains ongoing research aimed at refining their discriminating representations. In this paper, we propose a novel method that integrates Contrastive Learning (CL) with language modeling to address this challenge. Our approach introduces a self-augmentation method, wherein hidden representations of the model are augmented with their own representations. This method comprises two branches: the first branch, a traditional LM, learns features specific to the given data, while the second branch incorporates augmented representations from the first branch to encourage generalization. CL further refines these representations by pulling pairs of original and augmented versions closer while pushing other samples away. We evaluate our method on three NLP datasets encompassing binary, multi-label, and multi-class classification tasks involving social media posts related to various diseases. Our approach demonstrates notable improvements over traditional fine-tuning methods, achieving up to a 2.48% increase in F1-score compared to baseline approaches and a 2.1% enhancement over state-of-the-art methods.
Paper Structure (22 sections, 2 equations, 4 figures, 5 tables)

This paper contains 22 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The pipeline of our proposed method, consisting primarily of three components: a transformer model with standard settings, a transformer model with self-augmentation, and contrastive loss.
  • Figure 2: Performance Comparison of the Proposed Approach with Baseline Methods across 3 Datasets.
  • Figure 3: Embedding Visualizations for Dreaddit and DepressionEmo Datasets. The embeddings are visualized for the entire validation set of the Dreaddit dataset, and a subset (for clearer visualization) of the DepressionEmo dataset.
  • Figure 4: Embedding Visualizations for RHMD Dataset.