Datasets for Depression Modeling in Social Media: An Overview
Ana-Maria Bucur, Andreea-Codrina Moldovan, Krutika Parvatikar, Marcos Zampieri, Ashiqur R. KhudaBukhsh, Liviu P. Dinu
TL;DR
The paper addresses the challenge of locating and maintaining accessible depression-related datasets derived from social media amid shifting platform policies. It performs a systematic literature review (2019–2024) and compiles 59 data collections, complementing 310 identified papers, to provide a current, up-to-date dataset resource. It reveals that LT-EDI/DepSign, eRisk, and CLPsych datasets dominate benchmarks and discusses annotation strategies, data availability, and reliability concerns. The work offers a continuously updated resource to support interdisciplinary research, while outlining limitations, ethical considerations, and directions for future work, including non-English data expansion and multi-task learning approaches.
Abstract
Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.
