Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data
Shinka Mori, Oana Ignat, Andrew Lee, Rada Mihalcea
TL;DR
This work probes how synthetic depression data generated by GPT-3 reflects or distorts demographic patterns, introducing HeadRoom—a synthetic dataset of $3,120$ posts controlled for race, gender, and time (pre/post-COVID)—and comparing it to the human-generated UMD-ODH dataset. Through semantic analysis with Structural Topic Models and lexical analysis using LIWC-based log-odds, the study characterizes predominant stressors by demographic group and assesses cross-dataset fidelity, finding that synthetic data captures many real-world stressor patterns while also revealing additional, potentially model-induced topics. The results demonstrate partial algorithmic fidelity between synthetic and human data and provide methodological procedures, prompt-tuning practices, and analytic pipelines to probe biases in LLM-generated mental health data. The work highlights both the utility and risks of using synthetic data in sensitive domains, offering a reproducible framework and open resources to benchmark demographic representations in LLM-generated mental health content for research and system evaluation.
Abstract
Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.
