Table of Contents
Fetching ...

Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data

Shinka Mori, Oana Ignat, Andrew Lee, Rada Mihalcea

TL;DR

This work probes how synthetic depression data generated by GPT-3 reflects or distorts demographic patterns, introducing HeadRoom—a synthetic dataset of $3,120$ posts controlled for race, gender, and time (pre/post-COVID)—and comparing it to the human-generated UMD-ODH dataset. Through semantic analysis with Structural Topic Models and lexical analysis using LIWC-based log-odds, the study characterizes predominant stressors by demographic group and assesses cross-dataset fidelity, finding that synthetic data captures many real-world stressor patterns while also revealing additional, potentially model-induced topics. The results demonstrate partial algorithmic fidelity between synthetic and human data and provide methodological procedures, prompt-tuning practices, and analytic pipelines to probe biases in LLM-generated mental health data. The work highlights both the utility and risks of using synthetic data in sensitive domains, offering a reproducible framework and open resources to benchmark demographic representations in LLM-generated mental health content for research and system evaluation.

Abstract

Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.

Towards Algorithmic Fidelity: Mental Health Representation across Demographics in Synthetic vs. Human-generated Data

TL;DR

This work probes how synthetic depression data generated by GPT-3 reflects or distorts demographic patterns, introducing HeadRoom—a synthetic dataset of posts controlled for race, gender, and time (pre/post-COVID)—and comparing it to the human-generated UMD-ODH dataset. Through semantic analysis with Structural Topic Models and lexical analysis using LIWC-based log-odds, the study characterizes predominant stressors by demographic group and assesses cross-dataset fidelity, finding that synthetic data captures many real-world stressor patterns while also revealing additional, potentially model-induced topics. The results demonstrate partial algorithmic fidelity between synthetic and human data and provide methodological procedures, prompt-tuning practices, and analytic pipelines to probe biases in LLM-generated mental health data. The work highlights both the utility and risks of using synthetic data in sensitive domains, offering a reproducible framework and open resources to benchmark demographic representations in LLM-generated mental health content for research and system evaluation.

Abstract

Synthetic data generation has the potential to impact applications and domains with scarce data. However, before such data is used for sensitive tasks such as mental health, we need an understanding of how different demographics are represented in it. In our paper, we analyze the potential of producing synthetic data using GPT-3 by exploring the various stressors it attributes to different race and gender combinations, to provide insight for future researchers looking into using LLMs for data generation. Using GPT-3, we develop HEADROOM, a synthetic dataset of 3,120 posts about depression-triggering stressors, by controlling for race, gender, and time frame (before and after COVID-19). Using this dataset, we conduct semantic and lexical analyses to (1) identify the predominant stressors for each demographic group; and (2) compare our synthetic data to a human-generated dataset. We present the procedures to generate queries to develop depression data using GPT-3, and conduct analyzes to uncover the types of stressors it assigns to demographic groups, which could be used to test the limitations of LLMs for synthetic data generation for depression data. Our findings show that synthetic data mimics some of the human-generated data distribution for the predominant depression stressors across diverse demographics.
Paper Structure (37 sections, 3 figures, 12 tables)

This paper contains 37 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Topic Modeling: topic proportions between race and gender intersectionality -- African American women vs. African American men. The bars represent confidence intervals. The closer to the graph extremities, the more prevalent the topics are for the corresponding demographics
  • Figure 2: Topic Modeling: topic proportion between different demographics, as detected in GPT-generated data and in real-life data. Colors represent different races and genders: Men -- purple, Women -- orange, Asian -- magenta, African American -- green, Hispanic -- blue, and White -- red. The bars represent confidence intervals. The closer to the graph extremities, the more prevalent the topics for the corresponding demographics. For example, graph (a) Asian vs. African American shows that stressors such as work1/ work-fatigue, work2/ work-pressure and school are more prevalent for Asian than for African American. Best viewed in color.
  • Figure 3: Topic Modeling: topic proportion between different demographics, as detected in GPT-generated data and not in real-life data. Colors represent different races and genders: Men -- purple, Women -- orange, Asian -- magenta, African American -- green, Hispanic -- blue, and White -- red. The bars represent confidence intervals. The closer to the graph extremities, the more prevalent the topics for the corresponding demographics. Best viewed in color.