DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment

David Guecha; Aaryan Potdar; Anthony Miyaguchi

DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment

David Guecha, Aaryan Potdar, Anthony Miyaguchi

TL;DR

The paper investigates two eRisk 2024 tasks: Task 1 depression symptom ranking from the BDI-II questionnaire and Task 3 eating disorder severity estimation from Reddit post histories. It compares binary relevance classifiers against sentence-transformer based representations, finding that calibration issues hinder Task 1 ranking while embedding-based approaches improve performance, though further refinements are needed. For Task 3, BERT-based embeddings combined with classical ML models yield competitive results, with Random Forest excelling in high-dimensional space and Extra Trees after dimensionality reduction; overall, sentence transformers prove effective for text representation across tasks. The work underscores the importance of representation choices for early social-media based risk assessment and provides code and models at the referenced GitHub repository.

Abstract

We present working notes for DS@GT team in the eRisk 2024 for Tasks 1 and 3. We propose a ranking system for Task 1 that predicts symptoms of depression based on the Beck Depression Inventory (BDI-II) questionnaire using binary classifiers trained on question relevancy as a proxy for ranking. We find that binary classifiers are not well calibrated for ranking, and perform poorly during evaluation. For Task 3, we use embeddings from BERT to predict the severity of eating disorder symptoms based on user post history. We find that classical machine learning models perform well on the task, and end up competitive with the baseline models. Representation of text data is crucial in both tasks, and we find that sentence transformers are a powerful tool for downstream modeling. Source code and models are available at \url{https://github.com/dsgt-kaggle-clef/erisk-2024}.

DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment

TL;DR

Abstract

Paper Structure (17 sections, 4 figures, 6 tables)

This paper contains 17 sections, 4 figures, 6 tables.

Introduction
Related Work
Task 1: Search for Symptoms of Depression
Dataset
Methodology
Preprocessing
Modeling
Results
Discussion and Future Work
Task 3: Measuring the Severity of the Signs of Eating Disorders
Dataset
Methodology
Preprocessing
Modeling
Results
...and 2 more sections

Figures (4)

Figure 1: Example post in a TREC document. The DOCNO field contains the document number, and the TEXT field contains the post content.
Figure 2: The modeling pipeline for Task 1 using sentence transformers. Binary relevance labels train a classifier that ranks documents based on their relevance to the BDI-II questionnaire. These relevance predictions help filter documents to limit transformation computation with a sentence transformer model. The final model ranks the documents based on their relevance to the BDI-II questionnaire.
Figure 3: A diagram of the Task 3 pipeline. Each user's post history is fed into a BERT model to generate embeddings. The embeddings are then fed into a machine-learning model to predict the EDE-Q responses.
Figure 4: Task 3 Model performance on vector embedding with high dimensions and after dimensionality reduction.

DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment

TL;DR

Abstract

DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (4)