DS@GT eRisk 2024: Sentence Transformers for Social Media Risk Assessment
David Guecha, Aaryan Potdar, Anthony Miyaguchi
TL;DR
The paper investigates two eRisk 2024 tasks: Task 1 depression symptom ranking from the BDI-II questionnaire and Task 3 eating disorder severity estimation from Reddit post histories. It compares binary relevance classifiers against sentence-transformer based representations, finding that calibration issues hinder Task 1 ranking while embedding-based approaches improve performance, though further refinements are needed. For Task 3, BERT-based embeddings combined with classical ML models yield competitive results, with Random Forest excelling in high-dimensional space and Extra Trees after dimensionality reduction; overall, sentence transformers prove effective for text representation across tasks. The work underscores the importance of representation choices for early social-media based risk assessment and provides code and models at the referenced GitHub repository.
Abstract
We present working notes for DS@GT team in the eRisk 2024 for Tasks 1 and 3. We propose a ranking system for Task 1 that predicts symptoms of depression based on the Beck Depression Inventory (BDI-II) questionnaire using binary classifiers trained on question relevancy as a proxy for ranking. We find that binary classifiers are not well calibrated for ranking, and perform poorly during evaluation. For Task 3, we use embeddings from BERT to predict the severity of eating disorder symptoms based on user post history. We find that classical machine learning models perform well on the task, and end up competitive with the baseline models. Representation of text data is crucial in both tasks, and we find that sentence transformers are a powerful tool for downstream modeling. Source code and models are available at \url{https://github.com/dsgt-kaggle-clef/erisk-2024}.
