Table of Contents
Fetching ...

Synthetic Data Generation with LLM for Improved Depression Prediction

Andrea Kang, Jun Yu Chen, Zoe Lee-Youngzie, Shuhao Fu

TL;DR

This paper proposes a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models and demonstrates a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection.

Abstract

Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.

Synthetic Data Generation with LLM for Improved Depression Prediction

TL;DR

This paper proposes a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models and demonstrates a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection.

Abstract

Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.

Paper Structure

This paper contains 18 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A flowchart of the Chain-of-Thought pipeline. The LLM was able to capture the transcript's key moments in both the synopsis and sentiment analysis. Once provided the higher PHQ-8 score, the LLM could generate a new synopsis and sentiment analysis that maintain the original details (in green) while showcasing a more depressed character (in red).
  • Figure 2: Data distribution for different PHQ-8 scores in the original (left) and combined datasets (right).
  • Figure 3: The current prompt used for generating a synthetic transcript in Llama 3.2.
  • Figure 4: PCA visualization of paragraph embeddings from the original and synthetic datasets. Each point represents a synopsis, with original data colored in blue and synthetic data colored in red.