Table of Contents
Fetching ...

Enhanced Labeling Technique for Reddit Text and Fine-Tuned Longformer Models for Classifying Depression Severity in English and Luganda

Richard Kimera, Daniela N. Rim, Joseph Kirabira, Ubong Godwin Udomah, Heeyoul Choi

TL;DR

This work tackles depression severity classification from Reddit text by deploying a labeling pipeline that fuses keyword-based extraction aligned with the Beck Depression Inventory (BDI) framework, a context-aware BART labeling step, and expert input, followed by fine-tuning a Longformer to classify English and Luganda text. The approach combines 1807 Reddit sentences, synthetic labeling, expert annotation, and a weighted majority voting scheme to produce six target classes, later consolidated to four due to data sparsity. Empirically, the fine-tuned Longformer outperforms classic baselines across languages (English: 48% accuracy; Luganda: 45%), though results are limited by small data size and translation quality. The work demonstrates the feasibility of multilingual depression severity detection on social media, with future potential improvements from larger datasets and linguistically informed Luganda translation.

Abstract

Depression is a global burden and one of the most challenging mental health conditions to control. Experts can detect its severity early using the Beck Depression Inventory (BDI) questionnaire, administer appropriate medication to patients, and impede its progression. Due to the fear of potential stigmatization, many patients turn to social media platforms like Reddit for advice and assistance at various stages of their journey. This research extracts text from Reddit to facilitate the diagnostic process. It employs a proposed labeling approach to categorize the text and subsequently fine-tunes the Longformer model. The model's performance is compared against baseline models, including Naive Bayes, Random Forest, Support Vector Machines, and Gradient Boosting. Our findings reveal that the Longformer model outperforms the baseline models in both English (48%) and Luganda (45%) languages on a custom-made dataset.

Enhanced Labeling Technique for Reddit Text and Fine-Tuned Longformer Models for Classifying Depression Severity in English and Luganda

TL;DR

This work tackles depression severity classification from Reddit text by deploying a labeling pipeline that fuses keyword-based extraction aligned with the Beck Depression Inventory (BDI) framework, a context-aware BART labeling step, and expert input, followed by fine-tuning a Longformer to classify English and Luganda text. The approach combines 1807 Reddit sentences, synthetic labeling, expert annotation, and a weighted majority voting scheme to produce six target classes, later consolidated to four due to data sparsity. Empirically, the fine-tuned Longformer outperforms classic baselines across languages (English: 48% accuracy; Luganda: 45%), though results are limited by small data size and translation quality. The work demonstrates the feasibility of multilingual depression severity detection on social media, with future potential improvements from larger datasets and linguistically informed Luganda translation.

Abstract

Depression is a global burden and one of the most challenging mental health conditions to control. Experts can detect its severity early using the Beck Depression Inventory (BDI) questionnaire, administer appropriate medication to patients, and impede its progression. Due to the fear of potential stigmatization, many patients turn to social media platforms like Reddit for advice and assistance at various stages of their journey. This research extracts text from Reddit to facilitate the diagnostic process. It employs a proposed labeling approach to categorize the text and subsequently fine-tunes the Longformer model. The model's performance is compared against baseline models, including Naive Bayes, Random Forest, Support Vector Machines, and Gradient Boosting. Our findings reveal that the Longformer model outperforms the baseline models in both English (48%) and Luganda (45%) languages on a custom-made dataset.
Paper Structure (7 sections, 3 tables)